Search papers, labs, and topics across Lattice.
This paper investigates the limitations of existing Visual Text Comprehension (VTC) pipelines, which treat rendering as a fixed preprocessing step, by analyzing how vision-language models (VLMs) process visualized text. The authors identify a "localization-without-utilization" phenomenon, where attention is localized but not effectively utilized for correct answers, and demonstrate that enlarging localized spans significantly improves performance. They introduce Attention-Guided Adaptive Rendering (AGAR), a model-agnostic method that enhances VLMs by adapting the rendered visual content based on the model's attention patterns, leading to substantial performance gains across multiple VTC benchmarks.
VLMs can miss critical context despite localized attention, but simply enlarging visual spans can dramatically boost comprehension accuracy.
Visual Text Comprehension (VTC) renders text into images for a vision-language model (VLM) to read, sidestepping LLM context-window limits and powering applications from long-page OCR to multi-page memory QA. Yet existing VTC pipelines treat rendering and layout as a fixed, content-agnostic preprocessing step and offer little mechanistic understanding of how VLMs internally process visualized text. Through a focused empirical study on VTC QA tasks, we reveal that VLMs exhibit a localization-without-utilization regime: evidence-localizing attention emerges sharply in the middle-to-late layers and is largely decoupled from answer correctness, yet simply enlarging the localized spans on the rendered page recovers a large fraction of the failures. Building on these observations, we propose AGAR (Attention-Guided Adaptive Rendering), a training-free, model-agnostic method that leverages a VLM's own middle-to-late layer attention to identify the top-K important visual patches, maps them back to word spans, and re-renders the page with those spans enlarged before re-inferring the answer. Extensive experiments across nine VTC benchmarks (short-form, long-context, and multi-page memory QA) and four VLM backbones show that AGAR (i)consistently improves off-the-shelf VLMs as a plug-and-play enhancement, (ii)composes with VLM post-training to yield further gains, and (iii)remains robust under both visual- and text-side input degradation.