Simon Willison’s Weblog: olmOCR

Feb 26, 2025

—

Source URL: https://simonwillison.net/2025/Feb/26/olmocr/#atom-everything
Source: Simon Willison’s Weblog
Title: olmOCR

Feedly Summary: olmOCR
New from Ai2 – olmOCR is “an open-source tool designed for high-throughput conversion of PDFs and other documents into plain text while preserving natural reading order".
At its core is allenai/olmOCR-7B-0225-preview, a Qwen2-VL-7B-Instruct variant trained on ~250,000 pages of diverse PDF content (both scanned and text-based) that were labelled using GPT-4o and made available as the olmOCR-mix-0225 dataset.
The olmocr Python library can run the model on any "recent NVIDIA GPU". I haven’t managed to run it on my own Mac yet – there are GGUFs out there but it’s not clear to me how to run vision prompts through them – but Ai2 offer an online demo which can handle up to ten pages for free.
Given the right hardware this looks like a very inexpensive way to run large scale document conversion projects:

We carefully optimized our inference pipeline for large-scale batch processing using SGLang, enabling olmOCR to convert one million PDF pages for just $190 – about 1/32nd the cost of using GPT-4o APIs.

Via Luca Soldaini
Tags: vision-llms, ai, qwen, llms, fine-tuning, pdf, generative-ai, ocr, ai2

AI Summary and Description: Yes

Summary: The text introduces olmOCR, an open-source tool for high-throughput PDF document conversion using AI. It highlights its innovative capabilities, affordability, and the underlying technology, making it relevant for professionals in AI and cloud computing security.

Detailed Description:

– **olmOCR Overview**: An open-source tool by Ai2 designed for efficient conversion of PDFs and documents into plain text while maintaining the natural reading order. This capability is critical for organizations needing to digitize documents for easier access and processing.

– **Core Technology**: Utilizes the allenai/olmOCR-7B-0225-preview model, which is a variant of Qwen2-VL-7B-Instruct. It is trained on a large dataset of approximately 250,000 pages that include both scanned and text-based content.

– **Dataset and Training**: The training involves using the olmOCR-mix-0225 dataset, which has been labeled through the GPT-4o model, ensuring a rich and diverse dataset that enhances the effectiveness of the document conversion.

– **Hardware Requirements**: The olmocr Python library is optimized to run on any modern NVIDIA GPU, suggesting a targeted infrastructure requirement for deploying the tool. However, the text also mentions a personal struggle in operation on a Mac that may point to cross-platform limitations.

– **Cost Efficiency**: Describes an efficient approach to scaling document conversion. By optimizing the inference pipeline for large-scale processing, olmOCR enables users to convert one million PDF pages for a mere $190, illustrating its cost-effectiveness compared to using the GPT-4o APIs, which is about 1/32 of the cost.

– **Accessibility**: Ai2 offers an online demo that allows users to test the tool with up to ten pages free of charge, facilitating wider adoption and experimentation among users before committing to more extensive projects.

– **Implications for Security Professionals**:
– As organizations manage increasing amounts of data locked in PDF formats, tools like olmOCR can assist in the digitization process, which pertains to data security and compliance.
– Understanding the usage of advanced AI models can lead to better practices in securing sensitive data as they are processed and converted.
– The integration of such tools within cloud computing infrastructures may require further examination of security implications, data governance, and compliance with regulations regarding data handling.

Overall, olmOCR stands out as a significant development in document processing using AI, relevant for both AI advances and implications on information security and compliance practices.