Search papers, labs, and topics across Lattice.
The paper introduces sorted Group Policy Optimization (sGPO), a method that optimizes training efficiency in Reinforcement Learning with Verifiable Rewards (RLVR) by dynamically adjusting the rollout group size based on query difficulty. By leveraging a small budget of inference FLOPs to assess the empirical success rate of queries, sGPO effectively filters out trivial and unsolvable queries, thereby maximizing the learning signal from each training rollout. The approach not only matches or exceeds the performance of existing methods but also reduces total training compute by a factor of three, including the cost of inference profiling.
Trading a fraction of inference compute for a threefold reduction in training costs, sGPO redefines efficiency in RLVR training.
Standard Reinforcement Learning with Verifiable Rewards (RLVR) training allocates a fixed rollout budget to every query, without regard for what each query's difficulty means for the current policy. This leads to two symmetric failure modes: easy queries produce near-zero advantage because the policy already solves them, while unsolvable queries produce no signal because the policy never solves them. Both regimes waste training FLOPs without contributing to a learning gradient. We introduce sorted Group Policy Optimization (sGPO), a compute-efficient strategy that trades a small budget of inference FLOPs for a large reduction in wasted training FLOPs. The key insight is that cheap inference compute can serve as a single offline proxy for query difficulty. By generating a small batch of parallel samples per query under the initial policy, we obtain a model-aware empirical success rate. This motivates setting the training rollout group size to the inverse of this success rate, a practical rule that maximizes sample efficiency by extracting the most advantage per generated rollout. This single profiling pass simultaneously drives data filtering (removing trivial queries and sub-sampling unsolvable ones), adaptive group size allocation, and curriculum construction (scheduling queries from easy to hard). sGPO matches or exceeds baseline performance while reducing total training compute by a factor of three, with the upfront inference profiling cost included.