BaiduBGI ResearchMar 11, 2026arXiv:2603.13398

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

Daxiang Dong, Mingming Zheng, Dong Xu, Chunhuan Luo, Bairong Zhuang, Yuxuan Li, Ruoyun He, Haoran Wang, Wenyu Zhang, Wenbo Wang, Yicheng Wang, Xuehan Xiong, Ayong Zheng, Xiao-gang Zuo, Ziwei Ou, Jing Gu, Quan-gui Guo, Jianmin Wu, Dawei Yin, Dou Shen

AI Summary

Qianfan-OCR is a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding by directly converting images to Markdown. To compensate for the loss of explicit layout analysis in end-to-end OCR, the authors introduce "Layout-as-Thought," a mechanism where special tokens trigger the generation of structured layout representations before producing final outputs. Qianfan-OCR achieves state-of-the-art performance on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), and surpasses Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B on key information extraction benchmarks.

Key Contribution

This new OCR model beats Gemini-3.1-Pro and Qwen3-VL-235B on key information extraction, thanks to its clever "Layout-as-Thought" process that recovers layout grounding in end-to-end OCR.

Abstract

We present Qianfan-OCR, a 4B-parameter end-to-end vision-language model that unifies document parsing, layout analysis, and document understanding within a single architecture. It performs direct image-to-Markdown conversion and supports diverse prompt-driven tasks including table extraction, chart understanding, document QA, and key information extraction. To address the loss of explicit layout analysis in end-to-end OCR, we propose Layout-as-Thought, an optional thinking phase triggered by special think tokens that generates structured layout representations -- bounding boxes, element types, and reading order -- before producing final outputs, recovering layout grounding capabilities while improving accuracy on complex layouts. Qianfan-OCR ranks first among end-to-end models on OmniDocBench v1.5 (93.12) and OlmOCR Bench (79.8), achieves competitive results on OCRBench, CCOCR, DocVQA, and ChartQA against general VLMs of comparable scale, and attains the highest average score on public key information extraction benchmarks, surpassing Gemini-3.1-Pro, Seed-2.0, and Qwen3-VL-235B. The model is publicly accessible via the Baidu AI Cloud Qianfan platform.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References24

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

Related Papers