Open-Sora Plan TeamPKUMay 21, 2026arXiv:2605.22703

Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

Shuo Yang, Jinda Lu, Chiyu Ma, Kexin Huang, Haoming Meng, Qihui Zhang, Yuyang Liu, Bolin Ding, Guoyin Wang, Li Yuan, Jingren Zhou

AI Summary

This paper identifies hard clipping in GRPO-style RLVR objectives as a bottleneck that discards informative signals near the clipping boundary, leading to training instability and suboptimal convergence. They propose Near-boundary Stochastic Rescue (NSR), a method that stochastically retains tokens slightly beyond the clipping threshold to recover these lost signals. Experiments across 7B-30B models show NSR improves training stability and performance compared to DAPO and GSPO, demonstrating the importance of near-boundary signals.

Key Contribution

Rigid reward clipping throws away valuable information just beyond the boundary, but a simple stochastic rescue of these signals can substantially boost RLVR performance.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a central paradigm for scaling LLM reasoning, yet its optimization often suffers from training instability and suboptimal convergence. Through a systematic dissection of clipping-based GRPO-style objectives, we identify the rigid clipping decision induced by hard clipping as a key practical bottleneck in the studied RLVR setups. Specifically, our analysis suggests that informative signals can lie in the near-boundary region just beyond the clipping threshold, and are therefore discarded by the standard hard-clipping rule. Notably, once this bottleneck is precisely identified, even simple stochastic perturbations at the boundary can recover meaningful performance gains. Building on this finding, we propose Near-boundary Stochastic Rescue (NSR), a minimal, plug-and-play modification that stochastically retains these slightly out-of-bound tokens to recover lost signals. While NSR, via stochastic sampling, can be interpreted as inducing an implicit gradient decay in expectation, our ablations reveal that its stochastic, boundary-local rescue mechanism is consistently more effective than deterministic gradient decay. Validated by extensive experiments across model sizes from 7B to 30B and both dense and MoE architectures, as a plug-and-play solution, NSR substantially improves training stability and delivers consistent gains over strong baselines such as DAPO and GSPO.

RLHF & Preference Learning Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Clipping Bottleneck: Stabilizing RLVR via Stochastic Recovery of Near-Boundary Signals

Related Papers