Search papers, labs, and topics across Lattice.
This paper introduces Intrinsic Signal Policy Optimization (ISPO), a novel approach that enhances reinforcement learning with verifiable rewards by incorporating intrinsic signals derived from the policy's own conditional probabilities. By addressing the issues of Zero-Advantage Collapse and Hallucinated Certainty, ISPO provides a more robust framework for long-chain reasoning in large language models. Experimental results across three base models and five mathematical reasoning benchmarks demonstrate that ISPO significantly outperforms existing methods, particularly in challenging scenarios where traditional approaches struggle.
ISPO reduces critical reasoning failures in RLVR by transforming reward structures, leading to superior performance on complex reasoning tasks.
Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for eliciting long-chain reasoning in large language models. However, existing methods based on Group Relative Policy Optimization (GRPO) rely on a binary outcome reward, which induces two structural failure modes: Zero-Advantage Collapse, in which all rollouts in a group share the same outcome and the gradient vanishes, and Hallucinated Certainty, in which the model becomes increasingly confident on incorrect rollouts late in training. We address both modes by densifying the reward with intrinsic signals computed entirely from the policy's own conditional probabilities, and propose ISPO (Intrinsic Signal Policy Optimization, which combines a sequence-level signal measuring how informative the thinking trajectory is for the final answer, with a token-level directional reward whose hallucinated-certainty hinge penalizes confidently-wrong predictions at critical decision tokens. Across three base models and five mathematical reasoning benchmarks, ISPO consistently outperforms competitive baselines, with the largest gains on the hardest benchmarks where zero-advantage collapse is most frequent, and training-dynamics diagnostics confirm that both failure modes are decreased.