NUSCASTencent AIApr 2, 2026arXiv:2604.02288

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

AI Summary

This paper introduces Sample-Routed Policy Optimization (SRPO), a novel on-policy RL framework that unifies Group-Relative Policy Optimization (GRPO) and Self-Distillation Policy Optimization (SDPO) by routing samples based on correctness. SRPO addresses the limitations of GRPO's coarse credit assignment and SDPO's late-stage instability by dynamically assigning correct samples to GRPO and incorrect samples to SDPO, further incorporating an entropy-aware weighting mechanism for distillation targets. Experiments across five benchmarks demonstrate that SRPO achieves superior performance compared to GRPO and SDPO, exhibiting both rapid early improvement and long-horizon stability.

Key Contribution

Achieve the best of both worlds in LLM policy optimization: SRPO combines the rapid gains of self-distillation with the long-term stability of group-relative methods, outperforming both by adaptively routing samples.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a standard paradigm for post-training large language models. While Group Relative Policy Optimization (GRPO) is widely adopted, its coarse credit assignment uniformly penalizes failed rollouts, lacking the token-level focus needed to efficiently address specific deviations. Self-Distillation Policy Optimization (SDPO) addresses this by providing denser, more targeted logit-level supervision that facilitates rapid early improvement, yet it frequently collapses during prolonged training. We trace this late-stage instability to two intrinsic flaws: self-distillation on already-correct samples introduces optimization ambiguity, and the self-teacher's signal reliability progressively degrades. To resolve these issues, we propose Sample-Routed Policy Optimization (SRPO), a unified on-policy framework that routes correct samples to GRPO's reward-aligned reinforcement and failed samples to SDPO's targeted logit-level correction. SRPO further incorporates an entropy-aware dynamic weighting mechanism to suppress high-entropy, unreliable distillation targets while emphasizing confident ones. Evaluated across five benchmarks and two model scales, SRPO achieves both the rapid early improvement of SDPO and the long-horizon stability of GRPO. It consistently surpasses the peak performance of both baselines, raising the five-benchmark average on Qwen3-8B by 3.4% over GRPO and 6.3% over SDPO, while simultaneously yielding moderate response lengths and lowering per-step compute cost by up to 17.2%.

Inference & Quantization RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References44

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Unifying Group-Relative and Self-Distillation Policy Optimization via Sample Routing

Related Papers