Search papers, labs, and topics across Lattice.
The paper introduces Open-Reasoner-Zero, an open-source implementation for scaling reinforcement learning on base models for reasoning tasks. It demonstrates that a simple PPO with GAE, combined with rule-based rewards and without KL regularization, can achieve state-of-the-art performance on reasoning benchmarks like AIME2024, MATH500, and GPQA Diamond, surpassing DeepSeek-R1-Zero while using only 1/10th of the training steps. The paper also analyzes the learned critic's ability to identify and devalue repetitive response patterns, leading to more robust advantage estimation and improved training stability.
Forget complex RLHF pipelines: simple PPO with rule-based rewards can outperform state-of-the-art reasoning models while slashing training costs by 90%.
We introduce Open-Reasoner-Zero, the first open source implementation of large-scale reasoning-oriented RL training on the base model focusing on scalability, simplicity and accessibility. Through extensive experiments, we demonstrate that a minimalist approach, vanilla PPO with GAE ($\lambda=1$, $\gamma=1$) and straightforward rule-based rewards, without any KL regularization, is sufficient to scale up both benchmark performance and response length, replicating the scaling phenomenon observed in DeepSeek-R1-Zero. Using the same base model, Qwen2.5-32B base, as DeepSeek-R1-Zero-Qwen-32B, our implementation achieves superior performance across AIME2024, MATH500, and GPQA Diamond, while demonstrating remarkable efficiency, requiring only 1/10 of the training steps compared to the DeepSeek-R1-Zero pipeline. Moreover, our analysis not only covers training dynamics and ablation for critical design choices, but also quantitatively shows how the learned critic in Reasoner-Zero training effectively identifies and devalues repetitive response patterns, yielding more robust advantage estimations and enhancing training stability. Embracing the principles of open-source, we release our source code, training data, and various model weights, fostering reproducibility and encouraging further exploration of the properties of related models.