Search papers, labs, and topics across Lattice.
This paper introduces a question-aware keyframe selection framework for VideoQA that uses pseudo keyframe labels generated by LMMs for supervision and a coverage regularization term to encourage diversity. By leveraging LMMs to create informative supervision signals, the method addresses the limitations of sparse supervision and redundant frame selection common in existing keyframe selection approaches. Experiments on NExT-QA demonstrate significant accuracy improvements, particularly for temporal and causal question types, validating the effectiveness of learnable keyframe selection for VideoQA.
LMMs can bootstrap themselves: pseudo-labels from LMMs provide surprisingly effective supervision for question-aware keyframe selection in VideoQA, leading to significant accuracy gains.
Large multimodal models (LMMs) have recently demonstrated remarkable performance in video question answering (VideoQA), yet reasoning over video remains challenging due to high inference cost and diluted information. Keyframe selection offers efficiency and sharper reasoning but suffers from sparse supervision and redundant frame choices when relying only on image-text similarity. We present a question-aware keyframe selection framework with two components: pseudo keyframe labels derived from LMMs that provide informative supervision and a coverage regularization that promotes diverse, complementary evidence across time. Experiments on NExT-QA show that our method significantly improves accuracy, especially for temporal and causal question types, establishing keyframe selection as an effective and learnable module for VideoQA.