Show HN: Zerox – document OCR with GPT-mini
Zerox OCR is a tool on GitHub for Optical Character Recognition (OCR) in AI applications. It offers functionality, pricing comparisons, installation guidance, and usage examples. Users can explore its features and seek support.
Read original articleThe GitHub URL contains information about Zerox OCR, a tool designed for Optical Character Recognition (OCR) to process documents for Artificial Intelligence (AI) purposes. The content covers details about the tool's functionality, pricing compared to other services, installation guidelines, and examples of how to use it. Users can access logic behind the tool, compare its pricing with other services, follow installation instructions, and explore practical usage cases. For additional information or support regarding Zerox OCR, users are encouraged to inquire further.
Related
Open-Source Perplexity – Omniplex
The Omniplex open-source project on GitHub focuses on core functionality, Plugins Development, and Multi-LLM Support. It utilizes TypeScript, React, Redux, Next.js, Firebase, and integrates with services like OpenAI and Firebase. Community contributions are welcomed.
Show HN: Xcapture-BPF – like Linux top, but with Xray vision
0x.tools simplifies Linux application performance analysis without requiring upgrades or heavy frameworks. It offers thread monitoring, CPU usage tracking, system call analysis, and kernel wait location identification. The xcapture-bpf tool enhances performance data visualization through eBPF. Installation guides are available for RHEL 8.1 and Ubuntu 24.04.
Dotenvx: A better dotenv – from the creator of `dotenv`
The GitHub repository for dotenvx offers detailed documentation covering features, installation, quickstart, advanced usage, examples, platform specifics, FAQs, and contribution guidelines, aiding users in effectively utilizing dotenvx.
Choose your own adventure style Incident Response
Command Zero is an autonomous platform for cyber investigations, offering threat hunting, identity-based investigations, and expert content to streamline operations and enhance security. It has received praise for empowering teams and reducing risks.
Oxidize – Notes on moving Harfbuzz and Freetype tools and libraries to Rust
The "oxidize" project on GitHub aims to migrate tasks from Python & C++ to Rust, such as shaping, rasterization, font compilation, and manipulation. It outlines objectives, priorities, and references. For more details, inquire further.
- Users discuss the accuracy and pricing of different OCR solutions, comparing Zerox OCR with Azure and Gemini models.
- There are concerns about the naming of Zerox OCR due to potential trademark issues with the Xerox company.
- Several commenters share their experiences with OCR tools, highlighting challenges with complex layouts and the need for human review in some cases.
- Suggestions for improving OCR accuracy include using confidence scores and separating OCR processing from formatting tasks.
- Some users express interest in local OCR solutions like Tesseract, questioning the need for paid services.
Here’s our pricing comparison:
*Gemini Pro* - $0.66 per 1k image inputs (batch) - $1.88 per text output (batch API, 1k tokens) - 395 pages per dollar
*Gemini Flash* - $0.066 per 1k images (batch) - $0.53 per text output (batch API, 1k tokens) - 1693 pages per dollar
*GPT-4o* - $1.91 per 1k images (batch) - $3.75 per text output (batch API, 1k tokens) - 177 pages per dollar
*GPT-4o-mini* - $1.91 per 1k images (batch) - $0.30 per text output (batch API, 1k tokens) - 452 pages per dollar
[1] https://community.openai.com/t/super-high-token-usage-with-g...
1. Prompt with examples. I included an example image with an example transcription as part of the prompt. This made GPT make fewer mistakes and improved output accuracy.
2. Confidence score. I extracted the embedded text from the PDF and compared the frequency of character triples in the source text and GPT’s output. If there was a significant difference (less than 90% overlap) I would log a warning. This helped detect cases when GPT omitted entire paragraphs of text.
The $10/1000 pages model includes layout detection (headers, etc.) as well as key-value pairs and checkbox detection.
I have continued to do proofs of concept with Gemini and GPT, and in general any new multimodal model that comes out but have it is not on par with the checkbox detection of azure.
In fact the results from Gemini/GPT4 aren't even good enough to use as a teacher for distillation of a "small" multimodal model specializing in layout/checkbox.
I would like to also shout out surya OCR which is up and coming. It's source available and free for under a certain funding or revenue milestone - I think $5m. It doesn't have word level detection yet but it's one of the more promising non-hyper scaler/ heavy commercial OCR tools I'm aware of.
const systemPrompt = `
Convert the following PDF page to markdown.
Return only the markdown with no explanation text.
Do not exclude any content from the page.
`;
For each subsequent page:
messages.push({
role: "system",
content: `Markdown must maintain consistent formatting with the following page: \n\n """${priorPage}"""`,
});Could be handy for general-purpose frontend tools.
Are you supporting the Batch API from OpenAI? This would lower costs by 50%. Many OCR tasks are not time-sensitive, so this might be a very good tradeoff.
Check it out, https://cluttr.ai
Runs entirely in browser, using OPFS + WASM.
Try with any PDF document in the playground - https://pg.llmwhisperer.unstract.com/
You are an expert in PDFs. You are helping a user extract text from a PDF.
Extract the text from the image as a structured json output.
Extract the data using the following schema:
{Page.model_json_schema()}
Example:
{{
"title": "Title",
"page_number": 1,
"sections": [
...
],
"figures": [
...
]
}}
https://binal.pub/2023/12/structured-ocr-with-gpt-vision/This will perform worse in cases where whatever understanding the large model has of the contents is needed to recognize indistinct symbols. But it will avoid cases where that very same understanding causes contents to be understood incorrectly due to the model’s assumptions of what the contents should be.
At least in my limited experiments with Claude, it’s easy for models to lose track of where they’re looking on the page and to omit things entirely. But if segmentation of the page is explicit, one can enforce that all contents end up in exactly one segment.
So far, I have collected over 500 receipts from around 10 countries with 30 different supermarkets in 5 different languages.
What has worked for me so far is having control over OCR and processing (for formatting/structuring) separately. I don't have the figures to provide a cost structure, but I'm looking for other solutions to improve both speed and accuracy. Also, I need to figure out a way to put a metric around accuracy. I will definitely give this a shot. Thanks a lot.
Why did you choose markdown? Did you try other output formats and see if you get better results?
Also, I wonder how HMTL performs. It would be a way to handle tables with groupings/merged cells
The majority of comments are related to prices and qualities
Also, is there any movements about product detection? These days I'm looking for solutions that can recognize goods in high accuracy and show [brand][product_name][variant]
I like the optimism.
I've needed to include human review when using previous generation OCR software; when I needed the results to be accurate. It's painstaking, but the OCR offered a speedup over fully-manual transcription. Have you given any thought to human-in-the-loop processes?
I would love to use this in a project if it could also caption embedded images to produce something for RAG...
Just
maintainFormat: true
did not seem to have any effect in my testing.Related
Open-Source Perplexity – Omniplex
The Omniplex open-source project on GitHub focuses on core functionality, Plugins Development, and Multi-LLM Support. It utilizes TypeScript, React, Redux, Next.js, Firebase, and integrates with services like OpenAI and Firebase. Community contributions are welcomed.
Show HN: Xcapture-BPF – like Linux top, but with Xray vision
0x.tools simplifies Linux application performance analysis without requiring upgrades or heavy frameworks. It offers thread monitoring, CPU usage tracking, system call analysis, and kernel wait location identification. The xcapture-bpf tool enhances performance data visualization through eBPF. Installation guides are available for RHEL 8.1 and Ubuntu 24.04.
Dotenvx: A better dotenv – from the creator of `dotenv`
The GitHub repository for dotenvx offers detailed documentation covering features, installation, quickstart, advanced usage, examples, platform specifics, FAQs, and contribution guidelines, aiding users in effectively utilizing dotenvx.
Choose your own adventure style Incident Response
Command Zero is an autonomous platform for cyber investigations, offering threat hunting, identity-based investigations, and expert content to streamline operations and enhance security. It has received praise for empowering teams and reducing risks.
Oxidize – Notes on moving Harfbuzz and Freetype tools and libraries to Rust
The "oxidize" project on GitHub aims to migrate tasks from Python & C++ to Rust, such as shaping, rasterization, font compilation, and manipulation. It outlines objectives, priorities, and references. For more details, inquire further.