CUHKEastern Institute of TechnologyOxfordTencent AIZJUJun 7, 2026arXiv:2606.08815

Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization

Zhanming Shen, Liyao Li, Yanyu Chen, Xuhang Zhu, Xiaomeng Hu, Qi Zhang, Ru Peng, Xiaoyu Shen, Haobo Wang, Junbo Zhao

AI Summary

This paper introduces Intrinsic Signal Policy Optimization (ISPO), a novel approach that enhances reinforcement learning with verifiable rewards by incorporating intrinsic signals derived from the policy's own conditional probabilities. By addressing the issues of Zero-Advantage Collapse and Hallucinated Certainty, ISPO provides a more robust framework for long-chain reasoning in large language models. Experimental results across three base models and five mathematical reasoning benchmarks demonstrate that ISPO significantly outperforms existing methods, particularly in challenging scenarios where traditional approaches struggle.

Key Contribution

ISPO reduces critical reasoning failures in RLVR by transforming reward structures, leading to superior performance on complex reasoning tasks.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for eliciting long-chain reasoning in large language models. However, existing methods based on Group Relative Policy Optimization (GRPO) rely on a binary outcome reward, which induces two structural failure modes: Zero-Advantage Collapse, in which all rollouts in a group share the same outcome and the gradient vanishes, and Hallucinated Certainty, in which the model becomes increasingly confident on incorrect rollouts late in training. We address both modes by densifying the reward with intrinsic signals computed entirely from the policy's own conditional probabilities, and propose ISPO (Intrinsic Signal Policy Optimization, which combines a sequence-level signal measuring how informative the thinking trajectory is for the final answer, with a token-level directional reward whose hallucinated-certainty hinge penalizes confidently-wrong predictions at critical decision tokens. Across three base models and five mathematical reasoning benchmarks, ISPO consistently outperforms competitive baselines, with the largest gains on the hardest benchmarks where zero-advantage collapse is most frequent, and training-dynamics diagnostics confirm that both failure modes are decreased.

Reasoning & Chain-of-Thought RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization

Related Papers