ByteDanceApr 9, 2026arXiv:2604.07791

SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents

Xinshun Feng, Xin Song, Xinhao Song, Lijun Li, Gongshen Liu, Jing Shao, Jing Shao

AI Summary

The paper introduces SEARL, a reinforcement learning framework for self-evolving agents that jointly optimizes policy and a tool graph memory. SEARL constructs a structured experience memory integrating planning and execution, enabling generalization across analogous contexts and tool reuse. Experiments on knowledge reasoning and mathematics tasks demonstrate that SEARL achieves more practical and efficient learning compared to methods relying on large LLMs or multi-agent frameworks.

Key Contribution

Self-evolving agents can now learn more efficiently in resource-constrained environments by explicitly structuring experience into a tool graph memory that facilitates planning and tool reuse.

Abstract

Recent advances in Reinforcement Learning with Verifiable Rewards (RLVR) have demonstrated significant potential in single-turn reasoning tasks. With the paradigm shift toward self-evolving agentic learning, models are increasingly expected to learn from trajectories by synthesizing tools or accumulating explicit experiences. However, prevailing methods typically rely on large-scale LLMs or multi-agent frameworks, which hinder their deployment in resource-constrained environments. The inherent sparsity of outcome-based rewards also poses a substantial challenge, as agents typically receive feedback only upon completion of tasks. To address these limitations, we introduce a Tool-Memory based self-evolving agentic framework SEARL. Unlike approaches that directly utilize interaction experiences, our method constructs a structured experience memory that integrates planning with execution. This provides a novel state abstraction that facilitates generalization across analogous contexts, such as tool reuse. Consequently, agents extract explicit knowledge from historical data while leveraging inter-trajectory correlations to densify reward signals. We evaluate our framework on knowledge reasoning and mathematics tasks, demonstrating its effectiveness in achieving more practical and efficient learning.

Reasoning & Chain-of-Thought Tool Use & Agents Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References30

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SEARL: Joint Optimization of Policy and Tool Graph Memory for Self-Evolving Agents

Related Papers