FudanHUSTNJUOhio StateShanghai InnovationMay 25, 2026arXiv:2605.25621

StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering

Ming Xie, Zizheng Huang, Xudong Tan, Chao Wang, Xiangyu Zeng, Wenxiao Wu, Tao Chen, Limin Wang

AI Summary

The paper introduces StreamOV, a framework for streaming omni-video understanding that addresses the limitations of existing offline methods in handling continuous, long-horizon audio-visual context and proactive response triggering. StreamOV employs a multimodal evidence-guided long-short term memory to condense historical context and a hidden-state-driven trigger to determine when to respond. The authors also introduce SOVBench, a new benchmark for online, multi-turn omni-modal evaluation, demonstrating StreamOV's state-of-the-art performance on streaming and omni-video tasks.

Key Contribution

Achieve real-time, proactive video understanding with StreamOV, which uses bounded memory and a novel response trigger to overcome the limitations of offline methods.

Abstract

While streaming omni-video understanding demands continuous perception and proactive, real-time interaction, this crucial area remains largely under-explored. Current omni-modal methods are inherently designed for offline settings, limiting their applicability in streaming scenarios due to two fundamental flaws. First, they lack robust mechanisms to manage continuously growing audio-visual context over long horizons and cannot autonomously initiate responses at opportune moments. Second, existing benchmarks are predominantly confined to offline, single-turn question answering, failing to capture continuous, multi-turn streaming interactions. To bridge these gaps, we propose StreamOV, a novel Streaming Omni-Video understanding framework for efficient online audio-visual reasoning with bounded memory and proactive response triggering. Specifically, StreamOV introduces a multimodal evidence-guided long-short term memory that condenses historical audio-visual context into compact informative evidence under a fixed budget. It further employs a hidden-state-driven trigger to decide when to respond, avoiding explicit silence-token generation and external routers. We also curate SOVBench, the first comprehensive benchmark for online, multi-turn omni-modal evaluation. Extensive experiments show that StreamOV achieves state-of-the-art performance across diverse streaming and omni-video benchmarks, demonstrating its effectiveness for both online and offline video understanding.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

StreamOV: Streaming Omni-Video Understanding via Evidence-Guided Memory and Response Triggering

Related Papers