Search papers, labs, and topics across Lattice.
The paper investigates the impact of token budget constraints on the performance of large language models (LLMs) employing different reasoning modalities (code, natural language, hybrid, or none) across mathematical benchmarks. By systematically ablating token budgets for reasoning in GPT-5.1, Gemini 3 Flash, DeepSeek-V3.2, and Grok 4.1, the study demonstrates that truncated reasoning chains can significantly degrade performance, with code-based reasoning exhibiting more graceful degradation compared to natural language or hybrid approaches. The results highlight the importance of complete reasoning chains and the model-dependent robustness to token constraints for deploying reasoning-specialized systems.
Cutting LLMs' reasoning token budget can backfire spectacularly, tanking performance even below that of models with *no* reasoning at all.
Reasoning-specialized models like OpenAI's 5.1 and DeepSeek-V3.2 allocate substantial inference compute to extended chain-of-thought (CoT) traces, yet reasoning tokens incur significant costs. How do different reasoning modalities of code, natural language, hybrid, or none do perform under token constraints? We introduce a framework that constrains models to reason exclusively through code, comments, both, or neither, then systematically ablates token budgets to 10\%, 30\%, 50\%, and 70\% of optimal. We evaluate four frontier models (GPT-5.1, Gemini 3 Flash, DeepSeek-V3.2, Grok 4.1) across mathematical benchmarks (AIME, GSM8K, HMMT). Our findings reveal: (1) \textbf{truncated reasoning can hurt} as DeepSeek-V3.2 achieves 53\% with no reasoning but only 17\% with truncated CoT at 50\% budget; (2) \textbf{code degrades gracefully} as Gemini's comments collapse to 0\% while code maintains 43-47\%; (3) \textbf{hybrid reasoning underperforms} single modalities; (4) \textbf{robustness is model-dependent} as Grok maintains 80-90\% at 30\% budget where OpenAI and DeepSeek collapse to 7-27\%. These results suggest incomplete reasoning chains actively mislead models, with implications for deploying reasoning-specialized systems under resource constraints.