Notre DameTencent AIApr 20, 2026arXiv:2604.18493

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

Zhenwen Liang, Yujun Zhou, Sidi Lu, Xiangliang Zhang

AI Summary

The paper identifies a failure mode in RL fine-tuning for LLMs on reasoning tasks: when base models are already strong, RL can lead to mode collapse due to a lack of informative failure cases for group-relative policy optimization (GRPO). To address this, they introduce Constrained Uniform Top-K Sampling (CUTS), which encourages exploration by uniformly sampling from high-confidence candidates, flattening the optimization landscape. Integrating CUTS into a Mixed-CUTS training framework, they demonstrate significant improvements in out-of-domain generalization on the AIME25 benchmark, achieving a 15.1% improvement in Pass@1 accuracy over standard GRPO with Qwen3 models.

Key Contribution

RL fine-tuning can *hurt* reasoning performance when your base LLM is already too good, unless you force it to explore more diverse solutions.

Abstract

Reinforcement Learning (RL) enhances LLM reasoning, yet a paradox emerges as models scale: strong base models saturate standard benchmarks (e.g., MATH), yielding correct but homogeneous solutions. In such environments, the lack of failure cases causes the advantage signal in group-relative algorithms (e.g., GRPO) to vanish, driving policies into mode collapse. To address this, we propose Constrained Uniform Top-K Sampling (CUTS), a parameter-free decoding strategy enforcing structure-preserving exploration. Unlike standard sampling that follows model biases, CUTS flattens the local optimization landscape by sampling uniformly from constrained high-confidence candidates. We integrate this into Mixed-CUTS, a training framework synergizing exploitative and exploratory rollouts to amplify intra-group advantage variance. Experiments on Qwen3 models demonstrate that our approach prevents policy degeneration and significantly boosts out-of-domain generalization. Notably, Mixed-CUTS improves Pass@1 accuracy on the challenging AIME25 benchmark by up to 15.1% over standard GRPO, validating that maintaining diversity within the semantic manifold is critical for rigorous reasoning.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Too Correct to Learn: Reinforcement Learning on Saturated Reasoning Data

Related Papers