Tencent Releases HunyuanOCR: A 1-Billion-Parameter Model That Redefines Optical Character Recognition
Tencent’s Hunyuan Vision Team has introduced HunyuanOCR, a new open-source Vision-Language Model (VLM) that achieves commercial-grade performance in Optical Character Recognition (OCR) tasks while remaining exceptionally lightweight. At just 1 billion parameters, this model challenges the notion that state-of-the-art accuracy requires massive computational scale, consistently outperforming traditional pipeline-based systems and many larger, general-purpose VLMs.
For decades, OCR systems relied on cascaded pipelines—a multi-step process involving separate modules for text detection, layout analysis, and recognition. This architecture is notorious for “error propagation,” where a small mistake in an early stage (like detecting a text block) severely compromises the final output’s accuracy and structural integrity.
HunyuanOCR radically simplifies this process by adopting a unified, end-to-end VLM architecture. It connects a native-resolution Vision Transformer (ViT) with a lightweight large language model (LLM) via an adaptive connector. This design allows the entire workflow—from spotting text to complex document parsing—to be completed in a single, streamlined inference pass, eliminating error accumulation and simplifying deployment.
The model’s efficiency is matched by its versatility, covering the full spectrum of modern OCR needs: text spotting, end-to-end document parsing, information extraction (IE), visual question answering (VQA), and multilingual translation.
To illustrate its precision, consider document parsing: HunyuanOCR can ingest a complex, multi-column technical paper and output the entire content in Markdown, automatically converting tables into structured HTML code and mathematical equations into LaTeX format. This capability makes it a powerful tool for retrieval-augmented generation (RAG) systems that depend on clean, structured data.
In competitive evaluations, HunyuanOCR secured first place in the ICDAR 2025 Document Image Machine Translation (DIMT) Small Model Track and achieved state-of-the-art results on OCRBench among VLMs under 3 billion parameters. For instance, on the comprehensive OmniDocBench document parsing benchmark, it scored 94.10, surpassing major commercial alternatives. Its robust spotting capability also excels in challenging real-world scenarios, reliably detecting text in artistic fonts and low-quality, densely packed documents.
The performance edge is attributed to a unique training recipe, which leverages a four-stage pre-training strategy combined with targeted Reinforcement Learning (RL). This innovative RL approach uses verifiable, fine-grained rewards—such as assessing the accuracy of both the bounding box localization and the recognized text simultaneously—to ensure the model learns to produce high-quality, structured, and accurate outputs across tasks.
By providing a compact, open-source, and highly efficient solution, HunyuanOCR provides a robust foundation for industrial applications, potentially democratizing advanced OCR capabilities for resource-constrained environments and edge devices.
Chat about this paper
To chat about this paper, you'll need a free Gemini API key from Google AI Studio.
Your API key will be stored securely in your browser's local storage.