WaterlooApr 6, 2026arXiv:2604.05117

Watch Before You Answer: Learning from Visually Grounded Post-Training

Eunjeong Hwang, Huaisong Zhang, Penghui Du, Yiming Jia, Dongfu Jiang, Xuan He, Shen Zhang, Ping Nie, Peter West, Kelsey Allen

AI Summary

The paper identifies a critical flaw in video understanding benchmarks and post-training datasets for VLMs: a significant portion of questions can be answered using text cues alone, undermining true visual grounding. To address this, they introduce VidGround, a method for curating post-training data by filtering out linguistically biased questions. Post-training VLMs with VidGround, especially when combined with RL-based methods, yields performance gains of up to 6.2 points while using less data, demonstrating the importance of data quality over complex algorithms.

Key Contribution

Current video understanding benchmarks and post-training datasets are riddled with linguistic biases, meaning VLMs might be acing tests without actually "watching" the video.

Abstract

It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References57

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Watch Before You Answer: Learning from Visually Grounded Post-Training

Related Papers