Apr 2, 2026arXiv:2604.02268

SKILL0: In-Context Agentic Reinforcement Learning for Skill Internalization

Zhengxi Lu, Zhiyuan Yao, Zhiyuan Yao, Jinyang Wu, Jinyang Wu, Chengcheng Han, Chengcheng Han, Qi Gu, Xunliang Cai, Xunliang Cai, Jun Xiao, Jun Xiao, Yueting Zhuang, Yongliang Shen

AI Summary

SKILL0 is introduced, an in-context reinforcement learning framework designed to internalize agent skills into model parameters, eliminating the need for inference-time skill retrieval. The framework employs a training-time curriculum that starts with full skill context and gradually removes it, using a dynamic curriculum to retain only helpful skills based on on-policy evaluation. Experiments on ALFWorld and Search-QA demonstrate that SKILL0 significantly improves performance over standard RL baselines while maintaining a compact context size.

Key Contribution

LLM agents can internalize skills via in-context RL, achieving zero-shot autonomous behavior without the token overhead and retrieval noise of traditional methods.

Abstract

Agent skills, structured packages of procedural knowledge and executable resources that agents dynamically load at inference time, have become a reliable mechanism for augmenting LLM agents. Yet inference-time skill augmentation is fundamentally limited: retrieval noise introduces irrelevant guidance, injected skill content imposes substantial token overhead, and the model never truly acquires the knowledge it merely follows. We ask whether skills can instead be internalized into model parameters, enabling zero-shot autonomous behavior without any runtime skill retrieval. We introduce SKILL0, an in-context reinforcement learning framework designed for skill internalization. SKILL0 introduces a training-time curriculum that begins with full skill context and progressively withdraws it. Skills are grouped offline by category and rendered with interaction history into a compact visual context, teaching he model tool invocation and multi-turn task completion. A Dynamic Curriculum then evaluates each skill file's on-policy helpfulness, retaining only those from which the current policy still benefits within a linearly decaying budget, until the agent operates in a fully zero-shot setting. Extensive agentic experiments demonstrate that SKILL0 achieves substantial improvements over the standard RL baseline (+9.7\% for ALFWorld and +6.6\% for Search-QA), while maintaining a highly efficient context of fewer than 0.5k tokens per step. Our code is available at https://github.com/ZJU-REAL/SkillZero.

RLHF & Preference Learning Tool Use & Agents Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References66

Year2026

VenueN/A

Related Papers

Finding related papers...