DAMONortheasternMay 5, 2026arXiv:2605.03903

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

Zhipeng Xu, Junhao Ji, Zulong Chen, Zhenghao Liu, Qing Liu, Chunyi Peng, Z. Qin, Zebo Xu, Jianqiang Wan, Jun Tang, Zhibo Yang, Shuai Bai, Dayiheng Liu

AI Summary

The authors introduce CC-OCR V2, a new OCR benchmark designed to evaluate Large Multimodal Models (LMMs) on real-world document processing tasks, encompassing text recognition, document parsing, grounding, key information extraction, and question answering. Experiments on 14 LMMs reveal a significant performance gap compared to existing benchmarks, highlighting their limitations in handling the complexities of practical enterprise document processing. The benchmark includes 7,093 high-difficulty samples and covers corner cases often missed by other datasets.

Key Contribution

Despite impressive OCR performance on existing benchmarks, today's best LMMs still struggle with the messy realities of enterprise document processing.

Abstract

Large Multimodal Models (LMMs) have recently shown strong performance on Optical Character Recognition (OCR) tasks, demonstrating their promising capability in document literacy. However, their effectiveness in real-world applications remains underexplored, as existing benchmarks adopt task scopes misaligned with practical applications and assume homogeneous acquisition conditions. To address this gap, we introduce CC-OCR V2, a comprehensive and challenging OCR benchmark tailored to real-world document processing. CC-OCR V2 focuses on practical enterprise document processing tasks and incorporates hard and corner cases that are critical yet underrepresented in prior benchmarks, covering 5 major OCR-centric tracks: text recognition, document parsing, document grounding, key information extraction, and document question answering, comprising 7,093 high-difficulty samples. Extensive experiments on 14 advanced LMMs reveal that current models fall short of real-world application requirements. Even state-of-the-art LMMs exhibit substantial performance degradation across diverse tasks and scenarios. These findings reveal a significant gap between performance on current benchmarks and effectiveness in real-world applications. We release the full dataset and evaluation toolkit at https://github.com/eioss/CC-OCR-V2.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References45

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CC-OCR V2: Benchmarking Large Multimodal Models for Literacy in Real-world Document Processing

Related Papers