Feb 26, 2026arXiv:2602.22769

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Yujie Zhao, Boqin Yuan, Bo Yuan, Junbo Huang, Haochen Yuan, Haocheng Yuan, Zhongming Yu, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Lanxiang Hu, Abhilash Shankarampeta, Abhilash Shankarampeta, Zimeng Huang, Zimeng Huang, Wentao Ni, Wentao Ni, Yuandong Tian, Jishen Zhao, Jishen Zhao

AI Summary

The paper introduces AMA-Bench, a new benchmark for evaluating long-horizon memory in LLM-based agents, addressing the gap between dialogue-centric benchmarks and real-world agentic applications that involve continuous agent-environment interactions. AMA-Bench comprises both real-world agentic trajectories with expert-curated QA and synthetic trajectories with rule-based QA, enabling evaluation across varying horizons. The authors find that existing memory systems struggle with causality, objectivity, and lossy retrieval, and propose AMA-Agent, a memory system incorporating a causality graph and tool-augmented retrieval, which achieves a 57.22% average accuracy on AMA-Bench, outperforming existing baselines by 11.16%.

Key Contribution

Current LLM memory systems falter when faced with the continuous, machine-generated interaction streams typical of real-world agentic applications, highlighting a critical need for causality-aware and tool-augmented memory architectures.

Abstract

Large Language Models (LLMs) are deployed as autonomous agents in increasingly complex applications, where enabling long-horizon memory is critical for achieving strong performance. However, a significant gap exists between practical applications and current evaluation standards for agent memory: existing benchmarks primarily focus on dialogue-centric, human-agent interactions. In reality, agent memory consists of a continuous stream of agent-environment interactions that are primarily composed of machine-generated representations. To bridge this gap, we introduce AMA-Bench (Agent Memory with Any length), which evaluates long-horizon memory for LLMs in real agentic applications. It features two key components: (1) a set of real-world agentic trajectories across representative agentic applications, paired with expert-curated QA, and (2) a set of synthetic agentic trajectories that scale to arbitrary horizons, paired with rule-based QA. Our comprehensive study shows that existing memory systems underperform on AMA-Bench primarily because they lack causality and objective information and are constrained by the lossy nature of similarity-based retrieval employed by many memory systems. To address these limitations, we propose AMA-Agent, an effective memory system featuring a causality graph and tool-augmented retrieval. Our results demonstrate that AMA-Agent achieves 57.22% average accuracy on AMA-Bench, surpassing the strongest memory system baselines by 11.16%.

Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References46

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Related Papers