August 17th, 2024

Open-source tool helps you convert PDF documents, web pages, etc., into Markdown

MinerU is a tool that converts PDFs into machine-readable formats like markdown and JSON, enhancing information extraction while preserving document structure and readability across multiple operating systems.

Read original articleLink Icon
Open-source tool helps you convert PDF documents, web pages, etc., into Markdown

MinerU is a tool designed to convert PDFs into machine-readable formats like markdown and JSON, aimed at improving the extraction of information from scientific literature and other documents. It addresses common challenges related to symbol conversion and enhances the readability of extracted content. Key features include the removal of headers, footers, and page numbers while preserving semantic continuity, outputting text in a logical order from multi-column documents, and maintaining the original structure of documents, including titles and lists. Additionally, MinerU can extract images, tables, and captions, and convert formulas to LaTeX. The tool is compatible with both CPU and GPU environments and works on Windows, Linux, and macOS. Users can quickly start by creating a conda environment for installation, downloading model weight files, and configuring the template file. The tool can be run via command line or accessed through its API. An online demo is available for users to test the tool without installation. The project is open-source, encouraging contributions and feedback through its GitHub issues page.

- MinerU converts PDFs to markdown and JSON for easier information extraction.

- It removes non-essential elements while preserving document structure and readability.

- The tool supports multiple operating systems and both CPU and GPU environments.

- Users can quickly set up and use MinerU through a conda environment.

- An online demo is available for testing the tool without installation.

Link Icon 3 comments
By @h-jones - 8 months
Anyone know how this compares to GROBID [1]? I'm looking at alternatives to GROBID as I'm not super pleased with its outputs. GROBID has a lot of great features for journal papers (reference extraction / parsing), but I'm only interested in cleanly extracting the body. Also considering nougat [2] but I haven't tried it yet.

[1] https://github.com/kermitt2/grobid

[2] https://github.com/facebookresearch/nougat

By @oliverkwebb - 8 months
Nice tool, I've been using html2md[1] and such. It's written in python and in beta so it's probably not the best for processing static sites and such. But still useful

[1]: https://github.com/suntong/html2md