Search papers, labs, and topics across Lattice.
This paper introduces MemDreamer, a novel framework that addresses the challenges of long-video understanding by decoupling perception and reasoning through a Hierarchical Graph Memory and an agentic retrieval mechanism. By incrementally streaming video data and employing a structured top-down architecture for semantic abstraction, MemDreamer significantly reduces the reasoning context window to just 2% of the full context while achieving state-of-the-art results across four benchmarks. The findings reveal a strong correlation between performance in logical reasoning and long-video understanding, suggesting that agentic capabilities can enhance multimodal comprehension in Vision-Language Models.
MemDreamer narrows the performance gap with human experts to just 3.7 points while slashing the reasoning context window to a mere 2% of full video ingestion.
Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.