Running RolmOCR on Vast.ai
RolmOCR from Reducto is an open-source document OCR model built on Qwen2.5-VL-7B. Unlike traditional OCR tools like Tesseract that rely on character-level pattern matching, RolmOCR uses vision-language understanding to interpret documents semantically. This means it handles rotated text, complex layouts, and mixed content without manual preprocessing, and can extract structured data directly rather than just raw text. This guide demonstrates extracting structured pricing data from invoice images.Prerequisites
- Vast.ai account with credits
- Vast.ai CLI installed (
pip install vastai)
Hardware Requirements
- GPU RAM: 16GB minimum, 60GB recommended for larger documents and wider context
- Disk: 80GB
- Static IP: Required for stable endpoint
Setting Up the CLI
Finding an Instance
Deploying the Instance
First, generate a secure API key to protect your endpoint:Installing Client Dependencies
Loading Sample Data
For this example, we’ll use a public invoice dataset from Hugging Face. Streaming mode avoids downloading the entire dataset:Encoding Images
The API expects base64-encoded images. This helper function also resizes images to reduce payload size and improve processing speed:Defining the Output Schema
Define a Pydantic model for the expected output. vLLM uses this schema with guided decoding—constraining the model’s token generation to only produce valid JSON matching your structure. This eliminates malformed responses and post-processing validation, which is critical for OCR pipelines where you need reliable field extraction:Calling the API
Create the OpenAI client and a function to extract invoice data. Theguided_json parameter passes your schema to vLLM’s constrained decoding engine, guaranteeing the response will be parseable JSON with exactly the fields you defined: