CUHKShanghai AI LabSJTUJun 2, 2026arXiv:2606.03503

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

Ziyang Liu, Xueda Shen, Yuzhe Gu, Yuzhe Gu, Songyang Gao, Songyang Gao, Kuikun Liu, Guangran Cheng, Guangran Cheng, Chengqi Lyu, Chengqi Lyu, Dahua Lin, Dahua Lin, Wenwei Zhang, Kai Chen, Kai Chen

AI Summary

This paper introduces ThoughtFold, a novel framework that employs introspective preference learning to address the inefficiencies caused by redundant explorations in long reasoning chains of Large Reasoning Models (LRMs). By identifying and penalizing unnecessary trial-and-error segments within correct trajectories, ThoughtFold enables models to focus on essential reasoning segments, effectively condensing their reasoning paths. Experimental results demonstrate that this approach reduces token usage by approximately 56% while preserving state-of-the-art accuracy, highlighting its potential for enhancing reasoning efficiency in LRMs.

Key Contribution

ThoughtFold cuts token usage by 56% without sacrificing accuracy by folding reasoning chains and eliminating redundant explorations.

Abstract

Large Reasoning Models (LRMs) have achieved remarkable progress thanks to Reinforcement Learning with Verifiable Rewards (RLVR) on Chain-of-Thoughts (CoTs). However, since long CoTs naturally contain trial and errors and mainstream RLVR approaches choose outcome-correct CoT trajectories for memorization, the redundant explorations in long CoTs are inevitably reinforced, which results in the over-thinking issues of LRMs. Previous attempts to resolve this issue mainly give more advantage to shorter trajectories, yet their learning signals are still outcome-based and cannot reduce the memorization of redundant explorations in long CoTs. Therefore, we propose ThoughtFold, a framework that leverages fine-grained preference learning to mitigate redundant explorations for efficient reasoning. ThoughtFold employs an introspective strategy to identify redundancy within each correct trajectory, which yields a spectrum of candidate sub-trajectories. Leveraging this spectrum, we introduce a masked preference optimization objective that explicitly penalizes redundant explorations and encourages the model to directly bridge essential reasoning segments, effectively folding its reasoning chains into a more concise path. Extensive experiments show that ThoughtFold significantly enhances efficiency. It reduces the token usage of DeepSeek-R1-Distill-Qwen-7B by approximately 56% while maintaining state-of-the-art accuracy.

Reasoning & Chain-of-Thought RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References66

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ThoughtFold: Folding Reasoning Chains via Introspective Preference Learning

Related Papers