Central South UniversityHKUSTZJUJun 5, 2026arXiv:2606.07512

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Cong Chen, Cong Chen, Guo Gan, Guo Gan, Kaixiang Ji, Kaixiang Ji, ChaoYang Zhang, Chao Zhang, Zhenyu Yang, Guangming Yao, Guangming Yao, Hao Chen, Jingdong Chen, Yingqing Yuan, Yi Yuan, Chunhua Shen

AI Summary

This paper introduces MemDreamer, a novel framework designed to enhance long video understanding by decoupling perception and reasoning through a Hierarchical Graph Memory architecture. By employing an agentic retrieval mechanism, MemDreamer efficiently processes lengthy visual sequences, achieving state-of-the-art performance across four benchmarks while significantly reducing the reasoning context window to just 2% of the full context. The results indicate a mere 3.7-point gap from human expert performance, alongside a notable 12.5-point absolute accuracy improvement, highlighting the effectiveness of agentic capabilities in multimodal comprehension.

Key Contribution

MemDreamer narrows the performance gap with human experts in long video understanding to just 3.7 points while processing only 2% of the full context.

Abstract

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.

Computer Vision Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

Related Papers