Search papers, labs, and topics across Lattice.
The paper investigates redundancy in Large Reasoning Models (LRMs) using Chain-of-Thought (CoT) prompting and finds that LRMs implicitly know when to stop reasoning, a capability masked by standard sampling methods. They introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm to leverage this implicit knowledge. Integrating SAGE with reinforcement learning (SAGE-RL) improves both reasoning accuracy and efficiency on mathematical benchmarks by incorporating SAGE-discovered efficient reasoning patterns into standard inference.
LRMs already know when to stop reasoning, but current sampling methods are holding them back.
Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. Recent studies show that longer reasoning chains are frequently uncorrelated with correctness and can even be detrimental to accuracy. In a further in-depth analysis of this phenomenon, we surprisingly uncover and empirically verify that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms. Motivated by this, we introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this efficient reasoning potential. Furthermore, integrating SAGE as mixed sampling into group-based reinforcement learning (SAGE-RL) enables SAGE-RL to effectively incorporate SAGE-discovered efficient reasoning patterns into standard pass@1 inference, markedly enhancing both the reasoning accuracy and efficiency of LRMs across multiple challenging mathematical benchmarks.