Search papers, labs, and topics across Lattice.
This paper introduces PreciseDoc, a Large Multimodal Model (LMM) designed to enhance the precision of visual grounding in text-rich document images, addressing the shortcomings of existing models. By leveraging a novel training paradigm that combines synthetic document generation with reinforcement learning, PreciseDoc significantly improves the localization of critical document elements necessary for accurate reasoning. Comprehensive evaluations reveal that this approach not only excels in traditional grounding tasks but also enables advanced functionalities like extracting personal information from CVs, marking a substantial advancement in document understanding capabilities.
PreciseDoc achieves unprecedented precision in grounding critical document elements, transforming how LMMs can interpret complex text-rich environments.
Visual grounding in documents is a crucial ability for Large Multimodal Models (LMMs) in areas such as document understanding, deep research and document error detection. However, existing approaches exhibit poor grounding precision in text-rich document images, often failing to accurately locate the critical document elements needed for reliable reasoning. To address this gap, we introduce PreciseDoc, an LMM specifically designed for precise element grounding and can be further optimized for Document VQA tasks. Specifically, to enhance the basic localization capability, we construct challenging training data by two pipelines capable of mass-producing high-quality documents with paired metadata of fine-grained coordinates, including synthetic hand-filled documents with camera effects. The model develops more real-world functions beyond straightforward localization of single text, such as locating personal information from CVs. Furthermore, we introduce a training paradigm for visual grounded reasoning where the grounding and reasoning are supervised jointly with reinforcement learning to improve the contribution of the grounded evidence. A comprehensive evaluation on various benchmarks demonstrates the advantage of the proposed data and methods in document spatial grounding and document understanding.