Open-source tool helps you convert PDF documents, web pages, etc., into Markdown
MinerU is a tool that converts PDFs into machine-readable formats like markdown and JSON, enhancing information extraction while preserving document structure and readability across multiple operating systems.
Read original articleMinerU is a tool designed to convert PDFs into machine-readable formats like markdown and JSON, aimed at improving the extraction of information from scientific literature and other documents. It addresses common challenges related to symbol conversion and enhances the readability of extracted content. Key features include the removal of headers, footers, and page numbers while preserving semantic continuity, outputting text in a logical order from multi-column documents, and maintaining the original structure of documents, including titles and lists. Additionally, MinerU can extract images, tables, and captions, and convert formulas to LaTeX. The tool is compatible with both CPU and GPU environments and works on Windows, Linux, and macOS. Users can quickly start by creating a conda environment for installation, downloading model weight files, and configuring the template file. The tool can be run via command line or accessed through its API. An online demo is available for users to test the tool without installation. The project is open-source, encouraging contributions and feedback through its GitHub issues page.
- MinerU converts PDFs to markdown and JSON for easier information extraction.
- It removes non-essential elements while preserving document structure and readability.
- The tool supports multiple operating systems and both CPU and GPU environments.
- Users can quickly set up and use MinerU through a conda environment.
- An online demo is available for testing the tool without installation.
Related
NuExtract: A LLM for Structured Extraction
NuExtract is a structure extraction model by NuMind, offering tiny and large versions. NuMind also provides NuNER Zero and sentiment analysis models. Mistral 7B, by Mistral AI, excels in benchmarks with innovative attention mechanisms.
Diff-pdf: tool to visually compare two PDFs
The GitHub repository offers "diff-pdf," a tool for visually comparing PDF files. Users can highlight differences in an enhanced PDF or use a GUI. Precompiled versions are available for various systems, with installation instructions.
Show HN: DOM to Semantic Markdown – For LLMs
The GitHub repository hosts "DOM to Semantic Markdown," converting HTML to Semantic Markdown for Large Language Models. It features AST conversion, main content detection, metadata capture, URL shortening, and npm support.
Show HN: LLM Aided OCR (Correcting Tesseract OCR Errors with LLMs)
The LLM-Aided OCR Project enhances Optical Character Recognition by integrating natural language processing and large language models, producing accurate documents from raw OCR text and supporting local and cloud-based LLMs.
Show HN: Index and search *all* your documents
The doc-parser-searcher GitHub repository offers a tool for indexing and searching documents using Apache Lucene and Tika, featuring OCR capabilities, customizable settings, and requiring Java 18 for operation.
Related
NuExtract: A LLM for Structured Extraction
NuExtract is a structure extraction model by NuMind, offering tiny and large versions. NuMind also provides NuNER Zero and sentiment analysis models. Mistral 7B, by Mistral AI, excels in benchmarks with innovative attention mechanisms.
Diff-pdf: tool to visually compare two PDFs
The GitHub repository offers "diff-pdf," a tool for visually comparing PDF files. Users can highlight differences in an enhanced PDF or use a GUI. Precompiled versions are available for various systems, with installation instructions.
Show HN: DOM to Semantic Markdown – For LLMs
The GitHub repository hosts "DOM to Semantic Markdown," converting HTML to Semantic Markdown for Large Language Models. It features AST conversion, main content detection, metadata capture, URL shortening, and npm support.
Show HN: LLM Aided OCR (Correcting Tesseract OCR Errors with LLMs)
The LLM-Aided OCR Project enhances Optical Character Recognition by integrating natural language processing and large language models, producing accurate documents from raw OCR text and supporting local and cloud-based LLMs.
Show HN: Index and search *all* your documents
The doc-parser-searcher GitHub repository offers a tool for indexing and searching documents using Apache Lucene and Tika, featuring OCR capabilities, customizable settings, and requiring Java 18 for operation.