Search papers, labs, and topics across Lattice.
The paper introduces Em-Garde, a framework for proactive streaming video understanding that addresses the efficiency-accuracy trade-off in existing VideoLLMs by decoupling semantic understanding from streaming perception. Em-Garde uses an Instruction-Guided Proposal Parser to transform user queries into structured visual proposals, and a Lightweight Proposal Matching Module for efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench show that Em-Garde achieves improved proactive response accuracy and efficiency compared to previous models.
Proactive VideoLLMs can finally be both accurate AND efficient thanks to a novel propose-match framework that decouples semantic understanding from streaming perception.
Recent advances in Streaming Video Understanding has enabled a new interaction paradigm where models respond proactively to user queries. Current proactive VideoLLMs rely on per-frame triggering decision making, which suffers from an efficiency-accuracy dilemma. We propose Em-Garde, a novel framework that decouples semantic understanding from streaming perception. At query time, the Instruction-Guided Proposal Parser transforms user queries into structured, perceptually grounded visual proposals; during streaming, a Lightweight Proposal Matching Module performs efficient embedding-based matching to trigger responses. Experiments on StreamingBench and OVO-Bench demonstrate consistent improvements over prior models in proactive response accuracy and efficiency, validating an effective solution for proactive video understanding under strict computational constraints.