Search papers, labs, and topics across Lattice.
The paper introduces MMTR-Bench, a new benchmark for evaluating Multimodal Large Language Models (MLLMs) on their ability to reconstruct masked text directly from visual context, without explicit prompts. This benchmark isolates layout understanding, visual grounding, and knowledge integration by requiring models to recover masked text from single or multi-page documents and webpages. Experiments on representative MLLMs demonstrate that MMTR-Bench poses a significant challenge, particularly for sentence- and paragraph-level reconstruction across multiple languages.
MLLMs struggle to "read" missing text directly from visual context, even when they possess the necessary visual grounding and layout understanding.
We introduce MMTR-Bench, a benchmark designed to evaluate the intrinsic ability of Multimodal Large Language Models (MLLMs) to reconstruct masked text directly from visual context. Unlike conventional question-answering tasks, MMTR-Bench eliminates explicit prompts, requiring models to recover masked text from single- or multi-page inputs across real-world domains such as documents and webpages. This design isolates the reconstruction task from instruction-following abilities, enabling a direct assessment of a model's layout understanding, visual grounding, and knowledge integration. MMTR-Bench comprises 2,771 test samples spanning multiple languages and varying target lengths. To account for this diversity, we propose a level-aware evaluation protocol. Experiments on representative MLLMs show that the benchmark poses a significant challenge, especially for sentence- and paragraph-level reconstruction. The homepage is available at https://mmtr-bench-dataset.github.io/MMTR-Bench/.