Search papers, labs, and topics across Lattice.
The paper identifies keyframe localization as the primary bottleneck in Video TextVQA, showing that frame-wise question answering significantly outperforms direct video-based inference. To address this, they introduce VTAgent, a question-guided agent framework that explicitly anchors relevant keyframes before answering. VTAgent achieves state-of-the-art results on Video TextVQA benchmarks, with an average improvement of +12.12 in accuracy and +11.15 in ANLS after supervised fine-tuning and reinforcement learning.
Video-LLMs are leaving performance on the table: explicitly anchoring to keyframes before answering questions unlocks significant gains in Video TextVQA.
Video text-based visual question answering (Video TextVQA) aims to answer questions by reasoning over visual textual content appearing in videos. Despite the strong multimodal video understanding capabilities of recent Video-LLMs, their performance on existing Video TextVQA benchmarks remains limited. To better understand this gap, we conduct an upper-bound analysis through frame-wise question answering, counting a sample as correct if any frame yields the right answer, which significantly outperforms direct video-based inference and reveals a substantial performance gap. The results suggest that the primary bottleneck lies in the localization of key question-relevant evidence, rather than in reasoning capacity itself. Building on this insight, we propose a question-guided agent framework that explicitly anchors the relevant keyframes before answering. The approach operates effectively in a training-free setting and consistently surpasses direct video inference. With additional supervised fine-tuning (SFT) and reinforcement learning (RL), it achieves an average improvement of +12.12 in accuracy and +11.15 in ANLS across benchmarks, establishing new state-of-the-art results. Our study underscores the critical role of explicit keyframe anchoring for advancing Video TextVQA. The code will be publicly released.