Mar 31, 2026arXiv:2603.29252

Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

Chao Chang, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji

AI Summary

This paper introduces Flexible Memory (FlexMem), a training-free approach to enhance long video understanding in Multimodal Large Language Models (MLLMs) by mimicking human video-watching behavior through continual content processing and relevant memory recall. FlexMem utilizes visual KV caches as memory sources and employs a dual-pathway compression design for efficient memory transfer and writing, along with diverse memory reading strategies tailored for different video understanding tasks. Experiments on five long video and one streaming video task demonstrate that FlexMem significantly improves performance, enabling the processing of over 1k frames on a single 3090 GPU and achieving comparable or superior results to state-of-the-art MLLMs like GPT-4o and Gemini-1.5 Pro.

Key Contribution

Forget expensive training: FlexMem unlocks SOTA long-video MLLM performance on a single GPU by cleverly mimicking human memory recall.

Abstract

Long video understanding is a key challenge that plagues the advancement of \emph{Multimodal Large language Models} (MLLMs). In this paper, we study this problem from the perspective of visual memory mechanism, and proposed a novel and training-free approach, termed \emph{Flexible Memory} (\textbf{FlexMem}). In principle, FlexMem aims to mimic human behavior of video watching, \emph{i.e.}, continually watching video content and recalling the most relevant memory fragments to answer the question. In this way, FlexMem can help MLLMs achieve video understanding of infinite lengths, unlike previous methods that process all video information at once and have input upper-limit. Concretely, FlexMem first consider the visual KV caches as the memory sources, and realize the effective memory transfer and writing via a dual-pathway compression design. Afterwards, FlexMem also explores different memory reading strategies for the diverse video understanding tasks, including the popular streaming one. To validate FlexMem, we apply it to two popular video-MLLMs, and conduct extensive experiments on five long video and one streaming video task. The experimental results show that on \textbf{a single 3090 GPU}, our FlexMem can achieve obvious improvements than existing efficient video understanding methods and process more than \textbf{1k frames}, which also helps the base MLLMs achieve comparable or even better performance than SOTA MLLMs on some benchmarks, \emph{e.g.} , GPT-4o and Gemini-1.5 Pro.

Architecture Design (Transformers, SSMs, MoE)Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Scaling the Long Video Understanding of Multimodal Large Language Models via Visual Memory Mechanism

Related Papers