KAISTNYUJun 2, 2026arXiv:2606.03920

Benchmarking Visual State Tracking in Multimodal Video Understanding

Sihyun Yu, Nanye Ma, Pinzhi Huang, Hyunseok Lee, Shusheng Yang, June Suk Choi, Ellis Brown, Oscar Michel, Boyang Zheng, Jinwoo Shin, Saining Xie

AI Summary

This paper introduces the Visual STAte Tracking benchmark (VSTAT), a novel evaluation tool for assessing visual state tracking capabilities in Multimodal Large Language Models (MLLMs) using 834 video clips and 1,500 complex questions. The findings reveal that state-of-the-art MLLMs significantly underperform compared to human capabilities, particularly in visually perceiving and tracking events over time, despite their strong performance on traditional benchmarks. The analysis indicates that while MLLMs can reason textually, they struggle with the visual aspects necessary for effective video understanding, highlighting a critical gap in current MLLM evaluations.

Key Contribution

MLLMs are failing to visually track events in videos, performing only modestly above baseline despite strong results on other benchmarks.

Abstract

Understanding a video requires more than recognizing isolated moments, as humans continuously track entities, states, and events over time. This capacity for visual state tracking is fundamental to video understanding, yet remains underexplored in current evaluations of Multimodal Large Language Models (MLLMs). We introduce Visual STAte Tracking benchmark (VSTAT), a video-based benchmark designed to diagnose visual state tracking in MLLMs. VSTAT consists of 834 clips drawn from both synthetic and real-world videos, paired with 1,500 questions that cannot be answered from any single frame or short segment, requiring continuous perception and integration of events across the entire video stream. Despite their strong performance on existing video benchmarks, we find that state-of-the-art MLLMs perform far below humans and only modestly above answer-prior baselines. To analyze this gap, we compare MLLMs' thinking traces with the underlying video stream to understand why and when MLLMs fail on VSTAT. We find that MLLMs reason and track correctly in text, but fail at visually perceiving the events they need to track. Finally, our preliminary evaluation suggests that recent agentic approaches, including MLLM-based video agents and coding agents, do not readily resolve these failures, still falling short on VSTAT.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Benchmarking Visual State Tracking in Multimodal Video Understanding

Related Papers