Mar 2, 2026arXiv:2603.01990

According to Me: Long-Term Personalized Referential Memory QA

Jingbiao Mei, Jinghong Chen, Guangyu Yang, Xinyu Hou, Margaret Li, Bill Byrne

AI Summary

The authors introduce ATM-Bench, a new benchmark for multimodal, multi-source personalized referential Memory QA designed to evaluate long-term memory recall in AI assistants. This benchmark uses approximately four years of privacy-preserving personal memory data with human-annotated question-answer pairs that require resolving personal references and multi-evidence reasoning. They also propose Schema-Guided Memory (SGM) to structurally represent memory items from different sources, demonstrating that it improves performance compared to descriptive memory baselines, although overall accuracy remains low (under 20%) on the most challenging subset.

Key Contribution

Current long-term memory systems struggle to recall and reason over realistic, multimodal personal data, achieving under 20% accuracy on a new challenging benchmark.

Abstract

Personalized AI assistants must recall and reason over long-term user memory, which naturally spans multiple modalities and sources such as images, videos, and emails. However, existing Long-term Memory benchmarks focus primarily on dialogue history, failing to capture realistic personalized references grounded in lived experience. We introduce ATM-Bench, the first benchmark for multimodal, multi-source personalized referential Memory QA. ATM-Bench contains approximately four years of privacy-preserving personal memory data and human-annotated question-answer pairs with ground-truth memory evidence, including queries that require resolving personal references, multi-evidence reasoning from multi-source and handling conflicting evidence. We propose Schema-Guided Memory (SGM) to structurally represent memory items originated from different sources. In experiments, we implement 5 state-of-the-art memory systems along with a standard RAG baseline and evaluate variants with different memory ingestion, retrieval, and answer generation techniques. We find poor performance (under 20\% accuracy) on the ATM-Bench-Hard set, and that SGM improves performance over Descriptive Memory commonly adopted in prior works. Code available at: https://github.com/JingbiaoMei/ATM-Bench

Eval Frameworks & Benchmarks Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

According to Me: Long-Term Personalized Referential Memory QA

Related Papers