Macao Polytechnic UniversityMeituanUT AustinXiamen UniversityJun 1, 2026arXiv:2606.02355

SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

Zhongyu He, Yuanfan Li, Fei Huang, Tianyu Chen, Siyuan Chen, Xingyang Li, Meng Hsuan Yu, Xiangrong Liu, Leyi Wei, Ke Zeng, Xunliang Cai

AI Summary

This paper introduces SIRI, a novel framework for training long-horizon LLM agents that enables them to autonomously discover and internalize skills without relying on external generators or inference-time skill retrieval. The three-phase approach includes warming up the policy with GiGPO, self-skill mining from successful trajectories, and distilling beneficial skills into the policy, leading to significant performance improvements. Experimental results on ALFWorld and WebShop demonstrate that SIRI enhances the performance of GiGPO, achieving state-of-the-art results against various baselines while simplifying the engineering complexity of skill-based methods.

Key Contribution

SIRI allows LLM agents to autonomously develop and internalize skills, achieving up to a 2.2% performance boost without external dependencies.

Abstract

Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during training or persistent skill retrieval at inference, increasing engineering complexity, context length, and deployment latency. We propose Self-Internalizing Reinforcement learning with Intrinsic skills (SIRI), a three-phase framework that enables agents to discover, validate, and internalize skills without external skill generators or inference-time skill banks. SIRI first warms up the policy with GiGPO to acquire basic interaction ability and collect successful skill-free trajectories. It then performs self-skill mining, where the current policy summarizes compact skills from its own successful plain rollouts and validates them through paired skill-augmented and skill-free rollouts. Finally, SIRI distills only beneficial skill-guided action tokens into the plain policy using trajectory-level utility and action-level advantage. At inference, the agent runs with the original prompt only. On ALFWorld and WebShop with Qwen2.5-7B-Instruct, SIRI improves GiGPO from 0.908 to 0.930 on ALFWorld and from 0.728 to 0.813 on WebShop, outperforming prompt-based, RL-based, and memory-augmented baselines. Further analysis shows that our self-mining strategy can achieve performance comparable to distillation with closed-source large model. Our code is available at https://github.com/kirito618/SIRI.

RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

Related Papers