Search papers, labs, and topics across Lattice.
This paper introduces Moment-Video, a benchmark designed to evaluate the temporal fidelity of video multimodal large language models (MLLMs) in understanding momentary visual events. By focusing on localized, transient visual evidence critical for answering questions, the study reveals that even the best-performing model, Seed-2.0-Pro, achieves only 39.6% accuracy, while most open-source models fall below 25%. The findings highlight significant limitations in current MLLMs' ability to capture and utilize brief visual cues, underscoring the need for improved temporal representation in video understanding tasks.
Current video MLLMs struggle to grasp fleeting visual events, with top models barely surpassing 39% accuracy on critical momentary tasks.
Video multimodal large language models (MLLMs) have made rapid progress on general and long-form video understanding, yet their ability to preserve brief answer-critical visual evidence remains underexplored. Many practical questions are determined by momentary visual events: localized actions or state transitions that may last only a few frames. Such evidence can be skipped by sparse frame sampling, suppressed by visual-token compression, or diluted by coarse temporal aggregation, causing failures that language-side reasoning cannot reliably recover. We introduce Moment-Video, a benchmark for diagnosing the temporal fidelity of video MLLMs through momentary visual event understanding. Each question is grounded in a localized, visually observable, and sampling-sensitive event, requiring models to notice, count, describe, or reason about transient evidence rather than rely on persistent objects, global scene context, or language priors. Moment-Video contains 1,000 human-verified video-QA pairs across 7 domains and 25 fine-grained subcategories, covering four task types: Temporal Occurrence, Temporal Counting, Action Description, and Temporal Reasoning. We evaluate 33 proprietary and open-source MLLMs on Moment-Video. The best-performing model, Seed-2.0-Pro, achieves only 39.6% overall accuracy, while most open-source models remain below 25%, revealing a substantial gap in momentary visual event understanding. Diagnostic analyses show that denser frame sampling improves some models but does not eliminate the bottleneck, and longer videos introduce stronger temporal-localization challenges. These findings suggest that current video MLLMs still lack temporally faithful representations for capturing, preserving, and using brief but decisive visual evidence.