OpenAIDeepSeekFeb 16, 2026arXiv:2602.14444

Broken Chains: The Cost of Incomplete Reasoning in LLMs

Ian Su, Gaurav Purushothaman, Jey Narayan, Ruhika Goel, Kevin Zhu, Yash More, Maheep Chaudhary

AI Summary

The paper investigates the impact of token budget constraints on the performance of large language models (LLMs) employing different reasoning modalities (code, natural language, hybrid, or none) across mathematical benchmarks. By systematically ablating token budgets for reasoning in GPT-5.1, Gemini 3 Flash, DeepSeek-V3.2, and Grok 4.1, the study demonstrates that truncated reasoning chains can significantly degrade performance, with code-based reasoning exhibiting more graceful degradation compared to natural language or hybrid approaches. The results highlight the importance of complete reasoning chains and the model-dependent robustness to token constraints for deploying reasoning-specialized systems.

Key Contribution

Cutting LLMs' reasoning token budget can backfire spectacularly, tanking performance even below that of models with *no* reasoning at all.

Abstract

Reasoning-specialized models like OpenAI's 5.1 and DeepSeek-V3.2 allocate substantial inference compute to extended chain-of-thought (CoT) traces, yet reasoning tokens incur significant costs. How do different reasoning modalities of code, natural language, hybrid, or none do perform under token constraints? We introduce a framework that constrains models to reason exclusively through code, comments, both, or neither, then systematically ablates token budgets to 10\%, 30\%, 50\%, and 70\% of optimal. We evaluate four frontier models (GPT-5.1, Gemini 3 Flash, DeepSeek-V3.2, Grok 4.1) across mathematical benchmarks (AIME, GSM8K, HMMT). Our findings reveal: (1) \textbf{truncated reasoning can hurt} as DeepSeek-V3.2 achieves 53\% with no reasoning but only 17\% with truncated CoT at 50\% budget; (2) \textbf{code degrades gracefully} as Gemini's comments collapse to 0\% while code maintains 43-47\%; (3) \textbf{hybrid reasoning underperforms} single modalities; (4) \textbf{robustness is model-dependent} as Grok maintains 80-90\% at 30\% budget where OpenAI and DeepSeek collapse to 7-27\%. These results suggest incomplete reasoning chains actively mislead models, with implications for deploying reasoning-specialized systems under resource constraints.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Broken Chains: The Cost of Incomplete Reasoning in LLMs

Related Papers