Tsinghua AICASJun 23, 2026arXiv:2606.24118

An LMM for Precisely Grounding Elements in Documents

Yijian LU, Chuangxin Zhao, Kai Sun, Lei Hou, Juanzi Li, Ji Qi

AI Summary

This paper introduces PreciseDoc, a Large Multimodal Model (LMM) designed to enhance the precision of visual grounding in text-rich document images, addressing the shortcomings of existing models. By leveraging a novel training paradigm that combines synthetic document generation with reinforcement learning, PreciseDoc significantly improves the localization of critical document elements necessary for accurate reasoning. Comprehensive evaluations reveal that this approach not only excels in traditional grounding tasks but also enables advanced functionalities like extracting personal information from CVs, marking a substantial advancement in document understanding capabilities.

Key Contribution

PreciseDoc achieves unprecedented precision in grounding critical document elements, transforming how LMMs can interpret complex text-rich environments.

Abstract

Visual grounding in documents is a crucial ability for Large Multimodal Models (LMMs) in areas such as document understanding, deep research and document error detection. However, existing approaches exhibit poor grounding precision in text-rich document images, often failing to accurately locate the critical document elements needed for reliable reasoning. To address this gap, we introduce PreciseDoc, an LMM specifically designed for precise element grounding and can be further optimized for Document VQA tasks. Specifically, to enhance the basic localization capability, we construct challenging training data by two pipelines capable of mass-producing high-quality documents with paired metadata of fine-grained coordinates, including synthetic hand-filled documents with camera effects. The model develops more real-world functions beyond straightforward localization of single text, such as locating personal information from CVs. Furthermore, we introduce a training paradigm for visual grounded reasoning where the grounding and reasoning are supervised jointly with reinforcement learning to improve the contribution of the grounded evidence. A comprehensive evaluation on various benchmarks demonstrates the advantage of the proposed data and methods in document spatial grounding and document understanding.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

An LMM for Precisely Grounding Elements in Documents

Related Papers