AI InnovationDeCoDE LabIBM ResearchRed HatJun 7, 2026arXiv:2606.08854

sGPO: Trading Inference FLOPs for Training Efficiency in RLVR

Shivchander Sudalairaj, Kai Xu, Akash Srivastava, Giorgio Giannone

AI Summary

The paper introduces sorted Group Policy Optimization (sGPO), a method that optimizes training efficiency in Reinforcement Learning with Verifiable Rewards (RLVR) by dynamically adjusting the rollout group size based on query difficulty. By leveraging a small budget of inference FLOPs to assess the empirical success rate of queries, sGPO effectively filters out trivial and unsolvable queries, thereby maximizing the learning signal from each training rollout. The approach not only matches or exceeds the performance of existing methods but also reduces total training compute by a factor of three, including the cost of inference profiling.

Key Contribution

Trading a fraction of inference compute for a threefold reduction in training costs, sGPO redefines efficiency in RLVR training.

Abstract

Standard Reinforcement Learning with Verifiable Rewards (RLVR) training allocates a fixed rollout budget to every query, without regard for what each query's difficulty means for the current policy. This leads to two symmetric failure modes: easy queries produce near-zero advantage because the policy already solves them, while unsolvable queries produce no signal because the policy never solves them. Both regimes waste training FLOPs without contributing to a learning gradient. We introduce sorted Group Policy Optimization (sGPO), a compute-efficient strategy that trades a small budget of inference FLOPs for a large reduction in wasted training FLOPs. The key insight is that cheap inference compute can serve as a single offline proxy for query difficulty. By generating a small batch of parallel samples per query under the initial policy, we obtain a model-aware empirical success rate. This motivates setting the training rollout group size to the inverse of this success rate, a practical rule that maximizes sample efficiency by extracting the most advantage per generated rollout. This single profiling pass simultaneously drives data filtering (removing trivial queries and sub-sampling unsolvable ones), adaptive group size allocation, and curriculum construction (scheduling queries from easy to hard). sGPO matches or exceeds baseline performance while reducing total training compute by a factor of three, with the upfront inference profiling cost included.

RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

sGPO: Trading Inference FLOPs for Training Efficiency in RLVR

Related Papers