Mar 16, 2026arXiv:2603.14953

Learning Question-Aware Keyframe Selection with Synthetic Supervision for Video Question Answering

AI Summary

This paper introduces a question-aware keyframe selection framework for VideoQA that uses pseudo keyframe labels generated by LMMs for supervision and a coverage regularization term to encourage diversity. By leveraging LMMs to create informative supervision signals, the method addresses the limitations of sparse supervision and redundant frame selection common in existing keyframe selection approaches. Experiments on NExT-QA demonstrate significant accuracy improvements, particularly for temporal and causal question types, validating the effectiveness of learnable keyframe selection for VideoQA.

Key Contribution

LMMs can bootstrap themselves: pseudo-labels from LMMs provide surprisingly effective supervision for question-aware keyframe selection in VideoQA, leading to significant accuracy gains.

Abstract

Large multimodal models (LMMs) have recently demonstrated remarkable performance in video question answering (VideoQA), yet reasoning over video remains challenging due to high inference cost and diluted information. Keyframe selection offers efficiency and sharper reasoning but suffers from sparse supervision and redundant frame choices when relying only on image-text similarity. We present a question-aware keyframe selection framework with two components: pseudo keyframe labels derived from LMMs that provide informative supervision and a coverage regularization that promotes diverse, complementary evidence across time. Experiments on NExT-QA show that our method significantly improves accuracy, especially for temporal and causal question types, establishing keyframe selection as an effective and learnable module for VideoQA.

Computer Vision Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Learning Question-Aware Keyframe Selection with Synthetic Supervision for Video Question Answering

Related Papers