Search papers, labs, and topics across Lattice.
This paper introduces LLMZero, a system that employs LLM agents to optimize reinforcement learning (RL) post-training strategies through tree search, revealing that capacity parameters accumulate monotonically while regularization parameters oscillate in response to training dynamics. By diagnosing pathologies at checkpoints and proposing coordinated multi-parameter transitions, LLMZero significantly improves performance across diverse GRPO tasks, achieving enhancements of 9% to 140% over baseline models and outperforming traditional grid and random search methods. The findings highlight the importance of adaptive training strategies that can flexibly respond to non-stationary exploration-exploitation tradeoffs, offering actionable design rules for multi-stage training.
LLMZero uncovers that adaptive training strategies can boost RL performance by up to 140% by dynamically adjusting regularization parameters in response to training dynamics.
RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.