CMU MLUBCOct 8, 2025arXiv:2510.06727

Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management

Miao Lu, Weiwei Sun, Weihua Du, Zhan Ling, Xuesong Yao, Kang Liu, Jiecao Chen

AI Summary

The paper introduces summarization-based context management for RL fine-tuning of LLMs in long-horizon multi-turn tool use, addressing the context length bottleneck. They formulate a policy gradient representation that allows standard LLM RL infrastructures to optimize both tool-use behaviors and summarization strategies end-to-end. The proposed algorithm, \texttt{SUPO}, demonstrates improved success rates and maintained or reduced context length on interactive function calling and searching tasks, even scaling beyond training-time summarization rounds at test time.

Key Contribution

Forget context window limits: this RL method uses LLM-generated summaries to train agents for long-horizon tasks, achieving higher success rates with less context.

Abstract

We study reinforcement learning (RL) fine-tuning of large language model (LLM) agents for long-horizon multi-turn tool use, where context length quickly becomes a fundamental bottleneck. Existing RL pipelines can suffer from degraded instruction following, excessive rollout costs, and most importantly, strict context limits. To address these challenges, we introduce summarization-based context management to training. In specific, it periodically compresses the tool using history by LLM-generated summaries that retain task-relevant information to keep a compact context while enabling the agent to scale beyond the fixed context window. Building on this formulation, we derive a policy gradient representation that seamlessly enables standard LLM RL infrastructures to optimize both tool-use behaviors as well as summarization strategies in an end-to-end fashion. We instantiate this framework with \underline{SU}mmarization augmented \underline{P}olicy \underline{O}ptimization (\texttt{SUPO}), an LLM RL algorithm that enables long-horizon training beyond a fixed context limit. Experiments on interactive function calling and searching tasks demonstrate that \texttt{SUPO} significantly improves the success rate while maintaining the same or even lower working context length compared to baselines. We also demonstrate that for complex searching tasks, \texttt{SUPO} can further improve the evaluation performance when scaling test-time maximum round of summarization beyond that of training time. Our results establish summarization-based context management as a principled and scalable approach for training RL agents beyond a fixed context length limit.

Natural Language Processing RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations8

Influential citations0

References44

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Scaling LLM Multi-turn RL with End-to-end Summarization-based Context Management

Related Papers