Search papers, labs, and topics across Lattice.
The paper identifies a critical flaw in video understanding benchmarks and post-training datasets for VLMs: a significant portion of questions can be answered using text cues alone, undermining true visual grounding. To address this, they introduce VidGround, a method for curating post-training data by filtering out linguistically biased questions. Post-training VLMs with VidGround, especially when combined with RL-based methods, yields performance gains of up to 6.2 points while using less data, demonstrating the importance of data quality over complex algorithms.
Current video understanding benchmarks and post-training datasets are riddled with linguistic biases, meaning VLMs might be acing tests without actually "watching" the video.
It is critical for vision-language models (VLMs) to comprehensively understand visual, temporal, and textual cues. However, despite rapid progress in multimodal modeling, video understanding performance still lags behind text-based reasoning. In this work, we find that progress is even worse than previously assumed: commonly reported long video understanding benchmarks contain 40-60% of questions that can be answered using text cues alone. Furthermore, we find that these issues are also pervasive in widely used post-training datasets, potentially undercutting the ability of post-training to improve VLM video understanding performance. Guided by this observation, we introduce VidGround as a simple yet effective solution: using only the actual visually grounded questions without any linguistic biases for post-training. When used in tandem with RL-based post-training algorithms, this simple technique improves performance by up to 6.2 points relative to using the full dataset, while using only 69.1% of the original post-training data. Moreover, we show that data curation with a simple post-training algorithm outperforms several more complex post-training techniques, highlighting that data quality is a major bottleneck for improving video understanding in VLMs. These results underscore the importance of curating post-training data and evaluation benchmarks that truly require visual grounding to advance the development of more capable VLMs. Project page: http://vidground.etuagi.com.