Search papers, labs, and topics across Lattice.
The paper introduces Real5-OmniDocBench, a new benchmark that physically reconstructs the entire OmniDocBench v1.5 dataset across five real-world scenarios to evaluate VLMs' robustness in document parsing. This benchmark provides a complete ground-truth mapping, enabling factor-wise attribution of performance degradation due to geometric distortions, optical artifacts, or model limitations. Experiments using this benchmark reveal a significant "reality gap" in document parsing, highlighting the need for more resilient document intelligence models.
VLMs that ace digital document parsing benchmarks still stumble badly when faced with real-world scanned, warped, or photographed documents, revealing a significant "reality gap."
While Vision-Language Models (VLMs) achieve near-perfect scores on digital document benchmarks like OmniDocBench, their performance in the unpredictable physical world remains largely unknown due to the lack of controlled yet realistic evaluations. We introduce Real5-OmniDocBench, the first benchmark that performs a full-scale, one-to-one physical reconstruction of the entire OmniDocBench v1.5 (1,355 images) across five critical real-world scenarios: Scanning, Warping, Screen-Photography, Illumination, and Skew. Unlike prior benchmark that either lack digital correspondence or employ partial sampling, our complete ground-truth mapping enables, for the first time, rigorous factor-wise attribution of performance degradation-allowing us to pinpoint whether failures stem from geometric distortions, optical artifacts, or model limitations. Our benchmark establishes a challenging new standard for the community, demonstrating that the 'reality gap' in document parsing is far from closed, and provides a diagnostic tool to guide the development of truly resilient document intelligence.