LongCat TeamApr 6, 2026arXiv:2604.05015

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Chaoyou Fu, Hao Yuan, Haozhi Yuan, Yuhao Dong, Yifan Zhang, Yifan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, Yongkang Xie, Yong Xie, Xiawu Zheng, Xuejiao Yang, Xue Yang, Haoyu Cao, Yunsheng Wu, Ziwei Liu, Ziwei Liu, Xing Sun, Caifeng Shan, Ran He

AI Summary

Video-MME-v2 is introduced as a new benchmark to address the saturation of existing video understanding benchmarks by rigorously evaluating robustness and faithfulness. It features a progressive tri-level hierarchy of video comprehension complexity, ranging from visual aggregation to temporal dynamics and multimodal reasoning. The benchmark also employs a group-based non-linear evaluation strategy to enforce consistency and coherence, revealing a significant performance gap between current models like Gemini-3-Pro and human experts, particularly in visual information aggregation and temporal modeling.

Key Contribution

Leaderboard-topping video models are still surprisingly brittle, failing on basic video reasoning tasks unless given the right textual cues.

Abstract

With the rapid advancement of video understanding, existing benchmarks are becoming increasingly saturated, exposing a critical discrepancy between inflated leaderboard scores and real-world model capabilities. To address this widening gap, we introduce Video-MME-v2, a comprehensive benchmark designed to rigorously evaluate the robustness and faithfulness of video understanding. To systematically evaluate model capabilities, we design a \textbf{progressive tri-level hierarchy} that incrementally increases the complexity of video comprehension, ranging from multi-point visual information aggregation, to temporal dynamics modeling, and ultimately to complex multimodal reasoning. Besides, in contrast to conventional per-question accuracy, we propose a \textbf{group-based non-linear evaluation} strategy that enforces both consistency across related queries and coherence in multi-step reasoning. It penalizes fragmented or guess-based correctness and assigns credit only to answers supported by valid reasoning. To guarantee data quality, Video-MME-v2 is constructed through a rigorously controlled human annotation pipeline, involving 12 annotators and 50 independent reviewers. Backed by \textbf{3,300 human-hours} and up to \textbf{5 rounds} of quality assurance, Video-MME-v2 aims to serve as one of the most authoritative video benchmarks. Extensive experiments reveal a substantial gap between current best model Gemini-3-Pro and human experts, and uncover a clear hierarchical bottleneck where errors in visual information aggregation and temporal modeling propagate to limit high-level reasoning. We further find that thinking-based reasoning is highly dependent on textual cues, improving performance with subtitles but sometimes degrading it in purely visual settings. By exposing these limitations, Video-MME-v2 establishes a demanding new testbed for the development of next-generation video MLLMs.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

Related Papers