Search papers, labs, and topics across Lattice.
This paper introduces Straggler-Aware Group Control (SAGC), a dynamic group-size controller designed to mitigate the negative impact of stragglers in synchronous reinforcement learning methods like Group Relative Policy Optimization (GRPO). By formulating group-size selection as an online constrained optimization problem, SAGC adapts the training group size based on real-time rollout behavior, resulting in reduced straggler incidence and improved wall-clock efficiency. The method not only enhances training efficiency but also yields competitive model quality on downstream reasoning tasks, outperforming static group-size baselines in many cases.
Straggler-Aware Group Control can significantly enhance the efficiency of synchronous reinforcement learning by dynamically optimizing group sizes, leading to faster training and better model performance.
Synchronous reinforcement learning methods such as Group Relative Policy Optimization (GRPO) provide stable and reproducible on-policy training, but they are highly vulnerable to stragglers, a single unusually long rollout can delay reward computation and parameter updates for the entire group. This problem becomes more severe as group size increases, creating a tension between the benefits of larger groups and the wall-clock cost of synchronization stalls. We propose Straggler-Aware Group Control (SAGC), a dynamic group-size controller that adapts the training group online based on observed rollout behavior. SAGC formulates group-size selection as an online constrained optimization problem, seeking to retain the benefits of larger groups while controlling the long-term rate of straggler events. Across synchronous GRPO and DAPO training, and on top of both vanilla and strong engineered baselines, SAGC consistently reduces straggler incidence and improves wall-clock efficiency while achieving competitive or better training reward. We further show that these gains transfer to final model quality: SAGC is competitive with or better than the strongest static group-size baseline on downstream reasoning benchmarks, and often produces shorter outputs without any explicit length penalty. These results position dynamic group control as a practical way to make synchronous on-policy RL more efficient and robust.