August 17th, 2024

Open-source tool helps you convert PDF documents, web pages, etc., into Markdown

MinerU is a tool that converts PDFs into machine-readable formats like markdown and JSON, enhancing information extraction while preserving document structure and readability across multiple operating systems.

Read original article

Open-source tool helps you convert PDF documents, web pages, etc., into Markdown

MinerU is a tool designed to convert PDFs into machine-readable formats like markdown and JSON, aimed at improving the extraction of information from scientific literature and other documents. It addresses common challenges related to symbol conversion and enhances the readability of extracted content. Key features include the removal of headers, footers, and page numbers while preserving semantic continuity, outputting text in a logical order from multi-column documents, and maintaining the original structure of documents, including titles and lists. Additionally, MinerU can extract images, tables, and captions, and convert formulas to LaTeX. The tool is compatible with both CPU and GPU environments and works on Windows, Linux, and macOS. Users can quickly start by creating a conda environment for installation, downloading model weight files, and configuring the template file. The tool can be run via command line or accessed through its API. An online demo is available for users to test the tool without installation. The project is open-source, encouraging contributions and feedback through its GitHub issues page.

- MinerU converts PDFs to markdown and JSON for easier information extraction.

- It removes non-essential elements while preserving document structure and readability.

- The tool supports multiple operating systems and both CPU and GPU environments.

- Users can quickly set up and use MinerU through a conda environment.

- An online demo is available for testing the tool without installation.

NuExtract: A LLM for Structured Extraction

NuExtract is a structure extraction model by NuMind, offering tiny and large versions. NuMind also provides NuNER Zero and sentiment analysis models. Mistral 7B, by Mistral AI, excels in benchmarks with innovative attention mechanisms.

Diff-pdf: tool to visually compare two PDFs

The GitHub repository offers "diff-pdf," a tool for visually comparing PDF files. Users can highlight differences in an enhanced PDF or use a GUI. Precompiled versions are available for various systems, with installation instructions.

Show HN: DOM to Semantic Markdown – For LLMs

The GitHub repository hosts "DOM to Semantic Markdown," converting HTML to Semantic Markdown for Large Language Models. It features AST conversion, main content detection, metadata capture, URL shortening, and npm support.

Show HN: LLM Aided OCR (Correcting Tesseract OCR Errors with LLMs)

The LLM-Aided OCR Project enhances Optical Character Recognition by integrating natural language processing and large language models, producing accurate documents from raw OCR text and supporting local and cloud-based LLMs.

Show HN: Index and search all your documents

The doc-parser-searcher GitHub repository offers a tool for indexing and searching documents using Apache Lucene and Tika, featuring OCR capabilities, customizable settings, and requiring Java 18 for operation.

3 comments

By @h-jones - 8 months

Anyone know how this compares to GROBID [1]? I'm looking at alternatives to GROBID as I'm not super pleased with its outputs. GROBID has a lot of great features for journal papers (reference extraction / parsing), but I'm only interested in cleanly extracting the body. Also considering nougat [2] but I haven't tried it yet.

[1] https://github.com/kermitt2/grobid

[2] https://github.com/facebookresearch/nougat

By @oliverkwebb - 8 months

Nice tool, I've been using html2md[1] and such. It's written in python and in beta so it's probably not the best for processing static sites and such. But still useful

[1]: https://github.com/suntong/html2md

Open-source tool helps you convert PDF documents, web pages, etc., into Markdown

Related

NuExtract: A LLM for Structured Extraction

Diff-pdf: tool to visually compare two PDFs

Show HN: DOM to Semantic Markdown – For LLMs

Show HN: LLM Aided OCR (Correcting Tesseract OCR Errors with LLMs)

Show HN: Index and search all your documents

Related

NuExtract: A LLM for Structured Extraction

Diff-pdf: tool to visually compare two PDFs

Show HN: DOM to Semantic Markdown – For LLMs

Show HN: LLM Aided OCR (Correcting Tesseract OCR Errors with LLMs)

Show HN: Index and search all your documents

Open-source tool helps you convert PDF documents, web pages, etc., into Markdown

Related

NuExtract: A LLM for Structured Extraction

Diff-pdf: tool to visually compare two PDFs

Show HN: DOM to Semantic Markdown – For LLMs

Show HN: LLM Aided OCR (Correcting Tesseract OCR Errors with LLMs)

Show HN: Index and search *all* your documents

Related

NuExtract: A LLM for Structured Extraction

Diff-pdf: tool to visually compare two PDFs

Show HN: DOM to Semantic Markdown – For LLMs

Show HN: LLM Aided OCR (Correcting Tesseract OCR Errors with LLMs)

Show HN: Index and search *all* your documents

Show HN: Index and search all your documents

Show HN: Index and search all your documents