JD Explore AcademyPKUUW-MadisonYuanpei CollegeJun 3, 2026arXiv:2606.05008

M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

Jie Huang, Ruixun Liu, Sirui Sun, Xinyi Yang, Yiwu Zhong

AI Summary

This paper introduces M^3Eval, a novel evaluation framework designed to systematically assess memory capabilities in multi-modal models, particularly in the context of long-form video understanding. By grounding the evaluation in cognitive psychology, the authors create tasks that isolate memory dimensions, revealing that current models struggle with disentangled representations and exhibit interference patterns that diverge from human memory. The findings underscore the critical yet overlooked role of memory in multi-modal models, highlighting areas for improvement in memory mechanisms and providing a valuable benchmark for future research.

Key Contribution

Multi-modal models fail to maintain coherent memory across video streams, revealing fundamental weaknesses in their memory architectures.

Abstract

As multi-modal models advance towards long-form video understanding, memory emerges as a critical capability. Despite substantial efforts in developing video datasets and benchmarks, existing works primarily focus on perception and reasoning, without systematically evaluating memory: what models retain, how faithfully information is preserved, and how robust memory remains under interference. To address this gap, we introduce M^3Eval, the first comprehensive evaluation framework and benchmark for probing different memory dimensions in multi-modal models. Grounded in cognitive psychology, our design features carefully constructed tasks that isolate key aspects of memory. Leveraging M^3Eval, we conduct extensive experiments across representative multi-modal models, revealing consistent weaknesses and distinctive behaviors. We find that models struggle to maintain disentangled representations when processing parallel video streams, exhibit interference patterns differing substantially from those observed in human memory, ground memory sources more reliably in the spatial domain than the temporal domain, and demonstrate limited symbolic memory. Collectively, our benchmark provides a valuable resource for future research, while our findings highlight memory as a fundamental yet underexplored capability and offer insights for designing more effective memory mechanisms in multi-modal models. Our code and dataset are available at https://pku-value-lab.github.io/m3eval-homepage.

Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

M^3Eval: Multi-Modal Memory Evaluation through Cognitively-Grounded Video Tasks

Related Papers