Stanford HAINTU TaiwanTexas A&MMay 26, 2026arXiv:2605.27101

Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models

Oscar Chew, Serhii Honcharenko, Qian-Hui Chen, Patricia Lu, Dishant Zaveri, Khoa D. Doan, Kuan-Hao Huang

AI Summary

DistractionBench is introduced to evaluate the temporal grounding and subject-event linking capabilities of VideoLLMs by inserting distracting, unrelated video segments. Experiments reveal that VideoLLMs exhibit a "bag-of-events" behavior, hallucinating interactions between entities from different segments and failing to maintain temporal consistency. Evaluation of 11 popular VideoLLMs demonstrates that all models exhibit this behavior, highlighting a critical weakness in their ability to understand temporally structured video.

Key Contribution

VideoLLMs are surprisingly bad at keeping track of who did what, frequently mixing up actions across different video segments like a confused movie editor.

Abstract

A key capability for video understanding is reliably linking subjects to events across time, yet whether Video Large Language Models (VideoLLMs) actually achieve this remains unclear. In this work, we introduce DistractionBench to evaluate whether VideoLLMs can robustly link subjects and events in the presence of unrelated video segments. Through controlled interventions, such as inserting short advertisement clips into longer videos, we show that VideoLLMs frequently hallucinate interactions between entities from different segments, incorrectly attributing actions from injected advertisements to subjects in the main video. We characterize this systematic hallucination as bag-of-events (BoE) behavior, where models process videos as collections of events rather than temporally structured sequences. Evaluating 11 popular VideoLLMs, we find that all models exhibit substantial BoE behavior. Our findings suggest that VideoLLMs lack reliable mechanisms for temporal grounding and motivate the development of models with more robust subject-event association.

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Pop-Up Distractions Reveal Bag-of-Events Behavior in Video Large Language Models

Related Papers