Search papers, labs, and topics across Lattice.
The paper introduces ACTOR-CURATOR, a curriculum learning framework for RL post-training of LLMs that learns a neural curator to dynamically select training problems from a large problem bank by optimizing for expected policy improvement. Problem selection is formulated as a non-stationary stochastic bandit problem, and a principled loss function is derived based on online stochastic mirror descent with regret guarantees. Experiments across reasoning benchmarks demonstrate that ACTOR-CURATOR outperforms uniform sampling and strong curriculum baselines, achieving significant gains in performance and training efficiency, including up to 80% speedup.
Stop hand-crafting RLHF curricula: ACTOR-CURATOR learns to dynamically select training problems, boosting performance by up to 30% and speeding up training by 80% on challenging reasoning tasks.
Post-training large foundation models with reinforcement learning typically relies on massive and heterogeneous datasets, making effective curriculum learning both critical and challenging. In this work, we propose ACTOR-CURATOR, a scalable and fully automated curriculum learning framework for reinforcement learning post-training of large language models (LLMs). ACTOR-CURATOR learns a neural curator that dynamically selects training problems from large problem banks by directly optimizing for expected policy performance improvement. We formulate problem selection as a non-stationary stochastic bandit problem, derive a principled loss function based on online stochastic mirror descent, and establish regret guarantees under partial feedback. Empirically, ACTOR-CURATOR consistently outperforms uniform sampling and strong curriculum baselines across a wide range of challenging reasoning benchmarks, demonstrating improved training stability and efficiency. Notably, it achieves relative gains of 28.6% on AIME2024 and 30.5% on ARC-1D over the strongest baseline and up to 80% speedup. These results suggest that ACTOR-CURATOR is a powerful and practical approach for scalable LLM post-training.