Tencent AIUSTCMay 28, 2026arXiv:2605.30159

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

Ziyan Liu, Zhezheng Hao, Yeqiu Chen, Hong Wang, Jingren Hou, Ruiying Ding, Yongkang Yang, Wence Ji, Weibing Xia, Feng Liu

AI Summary

This paper introduces Metacognitive Memory Policy Optimization (MMPO), a novel approach for training memory-augmented LLM agents that addresses the degradation of intermediate memory quality during long-horizon tasks. By employing Belief Entropy as a self-supervised proxy, MMPO penalizes memory summaries that lead to high epistemic uncertainty, thereby enhancing the clarity of the agent's belief about the latent task state. Experimental results demonstrate that MMPO significantly outperforms existing methods, achieving 97.1% performance even with extensive 1.75M-token contexts.

Key Contribution

Memory optimization that prioritizes belief clarity over mere outcome success can radically enhance long-horizon reasoning in LLMs.

Abstract

Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome-based reinforcement learning, failing to localize where intermediate memory quality degrades. As interactions unfold, ambiguous recursive summaries progressively discard task-relevant information and introduce semantic noise. This exacerbates belief deviation, obscuring the agent's estimate of the latent task state and ultimately derailing long-horizon reasoning. We therefore argue that memory optimization should focus not merely on trajectory-level success, but on the clarity of the belief induced by intermediate summaries. To this end, we introduce Belief Entropy, a self-supervised proxy that probes how uncertain the model remains about the latent task state given its current memory. Based on this proxy, we propose Metacognitive Memory Policy Optimization (MMPO). Instead of relying only on sparse outcome-based signals, MMPO provides fine-grained, memory-specific supervision via explicitly penalizing summaries that induce high epistemic uncertainty. Experiments show that MMPO consistently outperforms existing methods on diverse long-horizon tasks, maintaining 97.1% performance even when scaled to 1.75M-token contexts.

Reasoning & Chain-of-Thought Scalable Oversight & Alignment Theory Tool Use & Agents

Citation Metrics

Citations1

Influential citations0

References38

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

Related Papers