Hubei Key Laboratory of Multimedia and NetworkInstitute of Artificial IntelligenceNational Engineering Research Center for MultimediaSchool of Computer ScienceWHUMay 6, 2026arXiv:2605.04870

VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

Haibin He, Maoyuan Ye, Juhua Liu, Bo Du

AI Summary

The paper identifies keyframe localization as the primary bottleneck in Video TextVQA, showing that frame-wise question answering significantly outperforms direct video-based inference. To address this, they introduce VTAgent, a question-guided agent framework that explicitly anchors relevant keyframes before answering. VTAgent achieves state-of-the-art results on Video TextVQA benchmarks, with an average improvement of +12.12 in accuracy and +11.15 in ANLS after supervised fine-tuning and reinforcement learning.

Key Contribution

Video-LLMs are leaving performance on the table: explicitly anchoring to keyframes before answering questions unlocks significant gains in Video TextVQA.

Abstract

Video text-based visual question answering (Video TextVQA) aims to answer questions by reasoning over visual textual content appearing in videos. Despite the strong multimodal video understanding capabilities of recent Video-LLMs, their performance on existing Video TextVQA benchmarks remains limited. To better understand this gap, we conduct an upper-bound analysis through frame-wise question answering, counting a sample as correct if any frame yields the right answer, which significantly outperforms direct video-based inference and reveals a substantial performance gap. The results suggest that the primary bottleneck lies in the localization of key question-relevant evidence, rather than in reasoning capacity itself. Building on this insight, we propose a question-guided agent framework that explicitly anchors the relevant keyframes before answering. The approach operates effectively in a training-free setting and consistently surpasses direct video inference. With additional supervised fine-tuning (SFT) and reinforcement learning (RL), it achieves an average improvement of +12.12 in accuracy and +11.15 in ANLS across benchmarks, establishing new state-of-the-art results. Our study underscores the critical role of explicit keyframe anchoring for advancing Video TextVQA. The code will be publicly released.

Eval Frameworks & Benchmarks Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VTAgent: Agentic Keyframe Anchoring for Evidence-Aware Video TextVQA

Related Papers