July 23rd, 2024

Show HN: Zerox – document OCR with GPT-mini

Zerox OCR is a tool on GitHub for Optical Character Recognition (OCR) in AI applications. It offers functionality, pricing comparisons, installation guidance, and usage examples. Users can explore its features and seek support.

Read original article

CuriositySkepticismInterest

Show HN: Zerox – document OCR with GPT-mini

The GitHub URL contains information about Zerox OCR, a tool designed for Optical Character Recognition (OCR) to process documents for Artificial Intelligence (AI) purposes. The content covers details about the tool's functionality, pricing compared to other services, installation guidelines, and examples of how to use it. Users can access logic behind the tool, compare its pricing with other services, follow installation instructions, and explore practical usage cases. For additional information or support regarding Zerox OCR, users are encouraged to inquire further.

Open-Source Perplexity – Omniplex

The Omniplex open-source project on GitHub focuses on core functionality, Plugins Development, and Multi-LLM Support. It utilizes TypeScript, React, Redux, Next.js, Firebase, and integrates with services like OpenAI and Firebase. Community contributions are welcomed.

Show HN: Xcapture-BPF – like Linux top, but with Xray vision

0x.tools simplifies Linux application performance analysis without requiring upgrades or heavy frameworks. It offers thread monitoring, CPU usage tracking, system call analysis, and kernel wait location identification. The xcapture-bpf tool enhances performance data visualization through eBPF. Installation guides are available for RHEL 8.1 and Ubuntu 24.04.

Dotenvx: A better dotenv – from the creator of `dotenv`

The GitHub repository for dotenvx offers detailed documentation covering features, installation, quickstart, advanced usage, examples, platform specifics, FAQs, and contribution guidelines, aiding users in effectively utilizing dotenvx.

Choose your own adventure style Incident Response

Command Zero is an autonomous platform for cyber investigations, offering threat hunting, identity-based investigations, and expert content to streamline operations and enhance security. It has received praise for empowering teams and reducing risks.

Oxidize – Notes on moving Harfbuzz and Freetype tools and libraries to Rust

The "oxidize" project on GitHub aims to migrate tasks from Python & C++ to Rust, such as shaping, rasterization, font compilation, and manipulation. It outlines objectives, priorities, and references. For more details, inquire further.

AI: What people are saying

The comments on the Zerox OCR tool reveal various insights and concerns regarding OCR technology and its applications.

Users discuss the accuracy and pricing of different OCR solutions, comparing Zerox OCR with Azure and Gemini models.
There are concerns about the naming of Zerox OCR due to potential trademark issues with the Xerox company.
Several commenters share their experiences with OCR tools, highlighting challenges with complex layouts and the need for human review in some cases.
Suggestions for improving OCR accuracy include using confidence scores and separating OCR processing from formatting tasks.
Some users express interest in local OCR solutions like Tesseract, questioning the need for paid services.

30 comments

By @serjester - 9 months

It should be noted for some reason OpenAI prices GPT-4o-mini image requests at the same price as GPT-4o. I have a similar library but we found OpenAI has subtle OCR inconsistencies with tables (numbers will be inaccurate). Gemini Flash, for all its faults, seems to do really well as a replacement while being significantly cheaper.

Here’s our pricing comparison:

*Gemini Pro* - $0.66 per 1k image inputs (batch) - $1.88 per text output (batch API, 1k tokens) - 395 pages per dollar

*Gemini Flash* - $0.066 per 1k images (batch) - $0.53 per text output (batch API, 1k tokens) - 1693 pages per dollar

*GPT-4o* - $1.91 per 1k images (batch) - $3.75 per text output (batch API, 1k tokens) - 177 pages per dollar

*GPT-4o-mini* - $1.91 per 1k images (batch) - $0.30 per text output (batch API, 1k tokens) - 452 pages per dollar

[1] https://community.openai.com/t/super-high-token-usage-with-g...

[2] https://github.com/Filimoa/open-parse

By @8organicbits - 9 months

I'm surprised by the name choice, there's a large company with an almost identical name that has products that do this. May be worth changing it sooner rather than later.

https://duckduckgo.com/?q=xerox+ocr+software&t=fpas&ia=web

By @hugodutka - 9 months

I used this approach extensively over the past couple of months with GPT-4 and GPT-4o while building https://hotseatai.com. Two things that helped me:

1. Prompt with examples. I included an example image with an example transcription as part of the prompt. This made GPT make fewer mistakes and improved output accuracy.

2. Confidence score. I extracted the embedded text from the PDF and compared the frequency of character triples in the source text and GPT’s output. If there was a significant difference (less than 90% overlap) I would log a warning. This helped detect cases when GPT omitted entire paragraphs of text.

By @jerrygenser - 9 months

Azure document AI accuracy I would categorize as high not "mid". Including hand writing. However for the $1.5/1000 pages, it doesn't include layout detection.

The $10/1000 pages model includes layout detection (headers, etc.) as well as key-value pairs and checkbox detection.

I have continued to do proofs of concept with Gemini and GPT, and in general any new multimodal model that comes out but have it is not on par with the checkbox detection of azure.

In fact the results from Gemini/GPT4 aren't even good enough to use as a teacher for distillation of a "small" multimodal model specializing in layout/checkbox.

I would like to also shout out surya OCR which is up and coming. It's source available and free for under a certain funding or revenue milestone - I think $5m. It doesn't have word level detection yet but it's one of the more promising non-hyper scaler/ heavy commercial OCR tools I'm aware of.

By @ndr_ - 9 months

Prompts in the background:

  const systemPrompt = `
    Convert the following PDF page to markdown. 
    Return only the markdown with no explanation text. 
    Do not exclude any content from the page.
  `;

For each subsequent page: messages.push({ role: "system", content: `Markdown must maintain consistent formatting with the following page: \n\n """${priorPage}"""`, });

Could be handy for general-purpose frontend tools.

By @beklein - 9 months

Very interesting project, thank you for sharing.

Are you supporting the Batch API from OpenAI? This would lower costs by 50%. Many OCR tasks are not time-sensitive, so this might be a very good tradeoff.

By @surfingdino - 9 months

Xerox tried it a while ago. It didn't end well https://www.dkriesel.com/en/blog/2013/0802_xerox-workcentres...

By @bearjaws - 9 months

I did this for images using Tesseract for OCR + Ollama for AI.

Check it out, https://cluttr.ai

Runs entirely in browser, using OPFS + WASM.

By @constantinum - 9 months

If you want to do document OCR/PDF text extraction with decent accuracy without using an LLM, do give LLMWhisperer[1] a try.

Try with any PDF document in the playground - https://pg.llmwhisperer.unstract.com/

[1] - https://unstract.com/llmwhisperer/

By @binalpatel - 9 months

You can do some really cool things now with these models, like ask them to extract not just the text but figures/graphs as nodes/edges and it works very well. Back when GPT-4 with vision came out I tried this with a simple prompt + dumping in a pydantic schema of what I wanted and it was spot on, pretty much this (before json mode was a supported):

    You are an expert in PDFs. You are helping a user extract text from a PDF.

    Extract the text from the image as a structured json output.

    Extract the data using the following schema:

    {Page.model_json_schema()}

    Example:
    {{
      "title": "Title",
      "page_number": 1,
      "sections": [
        ...
      ],
      "figures": [
        ...
      ]
    }}

https://binal.pub/2023/12/structured-ocr-with-gpt-vision/

By @amluto - 9 months

My intuition is that the best solution here would be a division of labor: have the big multimodal model identify tables, paragraphs, etc, and output a mapping between segments of the document and texture output. Then a much simpler model that doesn’t try to hold entire conversations can process those segments into their contents.

This will perform worse in cases where whatever understanding the large model has of the contents is needed to recognize indistinct symbols. But it will avoid cases where that very same understanding causes contents to be understood incorrectly due to the model’s assumptions of what the contents should be.

At least in my limited experiments with Claude, it’s easy for models to lose track of where they’re looking on the page and to omit things entirely. But if segmentation of the page is explicit, one can enforce that all contents end up in exactly one segment.

By @aman2k4 - 9 months

I am using AWS Textract + LLM (OpenAI/Claude) to read grocery receipts for <https://www.5outapp.com>

So far, I have collected over 500 receipts from around 10 countries with 30 different supermarkets in 5 different languages.

What has worked for me so far is having control over OCR and processing (for formatting/structuring) separately. I don't have the figures to provide a cost structure, but I'm looking for other solutions to improve both speed and accuracy. Also, I need to figure out a way to put a metric around accuracy. I will definitely give this a shot. Thanks a lot.

By @refulgentis - 9 months

Fwiw have on good sourcing that OpenAI supplies Tesseract output to the LLM, so you're in a great place, best of all worlds

By @lootsauce - 9 months

In my own experiments I have had major failures where much of the text is fabricated by the LLM to the point where I just find it hard to trust even with great prompt engineering. What I have been very impressed with is it’s ability to take medium quality ocr from acrobat with poor formatting, lots of errors and punctuation problems and render 100% accurate and properly formatted output by simply asking it to correct the ocr output. This approach using traditional cheap ocr for grounding might be a really robust and cheap option.

By @jimmyechan - 9 months

Congrats! Cool project! I’d been curious about whether GPT would be good for this task. Looks like this answers it!

Why did you choose markdown? Did you try other output formats and see if you get better results?

Also, I wonder how HMTL performs. It would be a way to handle tables with groupings/merged cells

By @josefritzishere - 9 months

Xerox might want to have a word with you about that name.

By @ReD_CoDE - 9 months

It seems that there's a need for a benchmark to compare all solutions available in the market based on the quality and price

The majority of comments are related to prices and qualities

Also, is there any movements about product detection? These days I'm looking for solutions that can recognize goods in high accuracy and show [brand][product_name][variant]

By @samuell - 9 months

The problem I've not found one OCR solution to handle well is complex column based layouts in magazines. Perhaps one problem is that there are often images spanning anything from one to all columns, and so the text might flow in sometimes funny ways. But in this day and age, this must be possible to handle for the best AI-based tools?

By @jagermo - 9 months

ohh, that could finally be a great way to get my ttrpg books readable for kindle. I'll give it a try, thanks for that.

By @8organicbits - 9 months

> And 6 months from now it'll be fast, cheap, and probably more reliable!

I like the optimism.

I've needed to include human review when using previous generation OCR software; when I needed the results to be accurate. It's painstaking, but the OCR offered a speedup over fully-manual transcription. Have you given any thought to human-in-the-loop processes?

By @downrightmike - 9 months

Does it also produce a confidence number?

By @Dkuku - 9 months

Check gpt-4o, gpt-4o-mini uses around 20 times more tokens for the same image: https://youtu.be/ZWxBHTgIRa0?si=yjPB1FArs2DS_Rc9&t=655

By @ravetcofx - 9 months

I'd be more curious to see the performance over local models like LLaVa etc.

By @ipkstef - 9 months

I think i'm missing something.. why would i pay to ocr the images when i can do it locally for free? Tesseract runs pretty well on just cpu, wouldn't even need something crazy powerful.

By @cmpaul - 9 months

Great example of how LLMs are eliminating/simplifying giant swathes of complex tech.

I would love to use this in a project if it could also caption embedded images to produce something for RAG...

By @throwthrowuknow - 9 months

Have you compared the results to special purpose OCR free models that do image to text with layout? My intuition is mini should be just as good if not better.

By @jdthedisciple - 9 months

Very nice, seem to work pretty well!

Just

    maintainFormat: true

did not seem to have any effect in my testing.

By @fudged71 - 9 months

Llama 3.1 now has images support right? Could this be adapted there as well, maybe with groq for speed?

By @daft_pink - 9 months

I would really love something like this that could be run locally.

By @murmansk - 9 months

Man, this is just an awesome hack! Keep it up!

Show HN: Zerox – document OCR with GPT-mini

Related

Open-Source Perplexity – Omniplex

Show HN: Xcapture-BPF – like Linux top, but with Xray vision

Dotenvx: A better dotenv – from the creator of `dotenv`

Choose your own adventure style Incident Response

Oxidize – Notes on moving Harfbuzz and Freetype tools and libraries to Rust

Related

Open-Source Perplexity – Omniplex

Show HN: Xcapture-BPF – like Linux top, but with Xray vision

Dotenvx: A better dotenv – from the creator of `dotenv`

Choose your own adventure style Incident Response

Oxidize – Notes on moving Harfbuzz and Freetype tools and libraries to Rust