Apr 2, 2026arXiv:2604.02155

Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents

AI Summary

This paper investigates the impact of chain-of-thought (CoT) reasoning length on the performance of function-calling language agents using the Berkeley Function Calling Leaderboard v3 Multiple benchmark. The key finding is a non-monotonic relationship between CoT length and accuracy: a brief CoT (32 tokens) significantly improves performance, while longer CoTs (256 tokens) degrade it, primarily due to incorrect function selection and hallucination. To address this, the authors introduce Function-Routing CoT (FR-CoT), a structured brief-CoT method that enforces commitment to a valid function name, achieving comparable accuracy to free-form CoT with improved reliability.

Key Contribution

Chain-of-thought reasoning can actually *hurt* language agent performance in function-calling tasks, with brief reasoning outperforming both direct answers and lengthy deliberation.

Abstract

How much should a language agent think before taking action? Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood. We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0--512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark. Our central finding is a striking non-monotonic pattern on Qwen2.5-1.5B-Instruct: brief reasoning (32 tokens) dramatically improves accuracy by 45% relative over direct answers, from 44.0% to 64.0%, while extended reasoning (256 tokens) degrades performance well below the no-CoT baseline, to 25.0% (McNemar p<0.001). A three-way error decomposition reveals the mechanism. At d = 0, 30.5% of tasks fail because the model selects the wrong function from the candidate set; brief CoT reduces this to 1.5%, effectively acting as a function-routing step, while long CoT reverses the gain, yielding 28.0% wrong selections and 18.0% hallucinated functions at d = 256. Oracle analysis shows that 88.6% of solvable tasks require at most 32 reasoning tokens, with an average of 27.6 tokens, and a finer-grained sweep indicates that the true optimum lies at 8--16 tokens. Motivated by this routing effect, we propose Function-Routing CoT (FR-CoT), a structured brief-CoT method that templates the reasoning phase as"Function: [name] / Key args: [...],"forcing commitment to a valid function name at the start of reasoning. FR-CoT achieves accuracy statistically equivalent to free-form d = 32 CoT while reducing function hallucination to 0.0%, providing a structural reliability guarantee without budget tuning.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References18

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Brief Is Better: Non-Monotonic Chain-of-Thought Budget Effects in Function-Calling Language Agents

Related Papers