Jun 16, 2026arXiv:2606.18388

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

Haoyang Fang, Wei Zhu, Boran Han, Alex Zhang, Zhenyu Pan, Shuo Yang, Shuai Zhang, Jiading Gai, Peng Tang, Cuixiong Hu, Xuan Zhu, Huzefa Rangwala, George Karypis, Bernie Wang

AI Summary

This paper introduces LLMZero, a system that employs LLM agents to optimize reinforcement learning (RL) post-training strategies through tree search, revealing that capacity parameters accumulate monotonically while regularization parameters oscillate in response to training dynamics. By diagnosing pathologies at checkpoints and proposing coordinated multi-parameter transitions, LLMZero significantly improves performance across diverse GRPO tasks, achieving enhancements of 9% to 140% over baseline models and outperforming traditional grid and random search methods. The findings highlight the importance of adaptive training strategies that can flexibly respond to non-stationary exploration-exploitation tradeoffs, offering actionable design rules for multi-stage training.

Key Contribution

LLMZero uncovers that adaptive training strategies can boost RL performance by up to 140% by dynamically adjusting regularization parameters in response to training dynamics.

Abstract

RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.

RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

Related Papers