May 29, 2026arXiv:2605.30789

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Yiming Ren, Yiran Xu, Zicheng Lin, Chufan Shi, Yukang Chen, Di Wang, Tianhe Wu, Junjie Wang, Yujiu Yang, Yu Qiao, Ruihang Chu

AI Summary

This paper identifies a novel approach to enhancing rollout diversity in Group Relative Policy Optimization (GRPO) by leveraging smaller models as natural explorers for larger models. The authors demonstrate that smaller models exhibit higher policy-level diversity, which is temporally correlated and preserves logical consistency, leading to improved performance in training larger models. The proposed S2L-PO framework incorporates a progressive annealing strategy that transitions from small-model rollouts to larger model sampling, resulting in faster convergence and improved accuracy on mathematical reasoning benchmarks, achieving an 8.8% increase on AIME 24.

Key Contribution

Smaller models can significantly enhance the training of larger models by providing structured exploration signals that improve performance without the noise of traditional methods.

Abstract

We identify a new dimension for enhancing rollout diversity in Group Relative Policy Optimization (GRPO) for LLMs. While GRPO relies on diverse rollouts, prevailing strategies primarily increase diversity by injecting more token-level randomness, which may introduce step-wise noise and lead to incoherent trajectories. We uncover that smaller models within the same model family inherently exhibit higher policy-level diversity, indicated by their superior pass@k relative to larger counterparts as sample counts increase. Unlike token-level noise, this diversity is temporally correlated, preserves logical consistency, and provides structured exploration signals for gradient estimation. We thus propose S2L-PO (Small-to-Large Policy Optimization), a framework that leverages fixed small models as natural explorers to train larger models. To balance exploration and exploitation, we design a progressive annealing strategy that transitions from offline small-model rollouts to the large learner's own sampling. This shift elegantly avoids mid-training performance drops caused by the small model's capacity limits, achieving faster convergence and unlocking a higher performance ceiling. S2L-PO improves accuracy on diverse mathematical reasoning benchmarks (e.g., +8.8% on AIME 24 using a 1.7B explorer to guide the 8B model) while reducing rollout compute.

RLHF & Preference Learning Scaling Laws & Emergent Abilities

Citation Metrics

Citations0

Influential citations0

References50

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Smaller Models are Natural Explorers for Policy-Level Diversity in GRPO

Related Papers