CUHKHuaweiApr 5, 2026arXiv:2604.04184

AURA: Always-On Understanding and Real-Time Assistance via Video Streams

Xudong Lu, Yang Bo, Jinpeng Chen, Shuhan Li, Xintong Guo, Huankang Guan, Fang Liu, Dunyuan Xu, Peiwen Sun, Heyang Sun, Rui Liu

AI Summary

The paper introduces AURA, an end-to-end streaming visual interaction framework that allows VideoLLMs to continuously process video streams for real-time question answering and proactive responses. AURA unifies context management, data construction, training objectives, and deployment optimization to achieve stable long-horizon streaming interaction. The framework achieves state-of-the-art performance on streaming benchmarks and runs at 2 FPS on two 80G accelerators.

Key Contribution

Real-time, open-ended video understanding is now possible: AURA enables VideoLLMs to proactively respond to live video streams, moving beyond simple captioning.

Abstract

Video Large Language Models (VideoLLMs) have achieved strong performance on many video understanding tasks, but most existing systems remain offline and are not well-suited for live video streams that require continuous observation and timely response. Recent streaming VideoLLMs have made progress, yet current approaches often rely on decoupled trigger-response pipelines or are limited to captioning-style narration, reducing their effectiveness for open-ended question answering and long-horizon interaction. We propose AURA (Always-On Understanding and Real-Time Assistance), an end-to-end streaming visual interaction framework that enables a unified VideoLLM to continuously process video streams and support both real-time question answering and proactive responses. AURA integrates context management, data construction, training objectives, and deployment optimization for stable long-horizon streaming interaction. It achieves state-of-the-art performance on streaming benchmarks and supports a real-time demo system with ASR and TTS running at 2 FPS on two 80G accelerators. We release the AURA model together with a real-time inference framework to facilitate future research.

Computer Vision Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AURA: Always-On Understanding and Real-Time Assistance via Video Streams

Related Papers