Search papers, labs, and topics across Lattice.
This paper introduces CausalMem, a novel training-free approach for efficient streaming video understanding that constructs a dynamic memory bank with a fixed budget. By estimating the redundancy of visual tokens and updating the memory bank through an online semantic basis, CausalMem significantly enhances the retention of critical information while maintaining a compact memory footprint. Experimental results demonstrate that CausalMem achieves average accuracy gains of 3.2% and 3.0% on streaming and offline benchmarks, respectively, while compressing visual tokens by over 20 times with minimal storage requirements.
CausalMem achieves over 20x visual token compression while maintaining high accuracy in streaming video understanding, redefining memory efficiency in MLLMs.
Currently, streaming video understanding is still a daunting task for existing \emph{multimodal large language models} (MLLMs). Its difficulties not only lie in handling the ever-increasing video frames, but also in the unpredictability of future video content and input instructions. In this paper, we study this task from the perspective of constructing a dynamic but fixed-budget memory bank, and propose a novel and training-free approach termed \emph{\textbf{CausalMem}}. CausalMem is dedicated to constructing a dynamic visual memory update mechanism, thereby maximizing the amount of information in streaming video within a limited memory space, much like the human brain. In practice, CausalMem estimates the redundancy of visual tokens and updates the memory bank via an online semantic basis, which models the principal semantics of the observed video stream. To validate CausalMem, we apply it to two representative MLLMs, namely LLaVA-OneVision and Qwen2.5-VL respectively, and conduct extensive experiments on both streaming and offline video understanding benchmarks. The experimental results not only show the great advantages than existing methods under both streaming and offline settings, \emph{e.g.}, $+3.2\%$ and $+3.0\%$ average accuracy gains respectively, but also witness the superior semantic preservation for streaming videos, \emph{e.g.}, using 12$k$ token budgets to memorize hour-long streaming videos, which achieves more than \textbf{20$\times$} visual token compression ratio and only occupies about \textbf{82 MB} storage. \textbf{Our code} is given in \href{https://github.com/hktk07/CausalMem}{CausalMem}.