Search papers, labs, and topics across Lattice.
This paper investigates the impact of chain-of-thought (CoT) reasoning length on the performance of function-calling language agents using the Berkeley Function Calling Leaderboard v3 Multiple benchmark. The key finding is a non-monotonic relationship between CoT length and accuracy: a brief CoT (32 tokens) significantly improves performance, while longer CoTs (256 tokens) degrade it, primarily due to incorrect function selection and hallucination. To address this, the authors introduce Function-Routing CoT (FR-CoT), a structured brief-CoT method that enforces commitment to a valid function name, achieving comparable accuracy to free-form CoT with improved reliability.
Chain-of-thought reasoning can actually *hurt* language agent performance in function-calling tasks, with brief reasoning outperforming both direct answers and lengthy deliberation.
How much should a language agent think before taking action? Chain-of-thought (CoT) reasoning is widely assumed to improve agent performance, but the relationship between reasoning length and accuracy in structured tool-use settings remains poorly understood. We present a systematic study of CoT budget effects on function-calling agents, sweeping six token budgets (0--512) across 200 tasks from the Berkeley Function Calling Leaderboard v3 Multiple benchmark. Our central finding is a striking non-monotonic pattern on Qwen2.5-1.5B-Instruct: brief reasoning (32 tokens) dramatically improves accuracy by 45% relative over direct answers, from 44.0% to 64.0%, while extended reasoning (256 tokens) degrades performance well below the no-CoT baseline, to 25.0% (McNemar p<0.001). A three-way error decomposition reveals the mechanism. At d = 0, 30.5% of tasks fail because the model selects the wrong function from the candidate set; brief CoT reduces this to 1.5%, effectively acting as a function-routing step, while long CoT reverses the gain, yielding 28.0% wrong selections and 18.0% hallucinated functions at d = 256. Oracle analysis shows that 88.6% of solvable tasks require at most 32 reasoning tokens, with an average of 27.6 tokens, and a finer-grained sweep indicates that the true optimum lies at 8--16 tokens. Motivated by this routing effect, we propose Function-Routing CoT (FR-CoT), a structured brief-CoT method that templates the reasoning phase as"Function: [name] / Key args: [...],"forcing commitment to a valid function name at the start of reasoning. FR-CoT achieves accuracy statistically equivalent to free-form d = 32 CoT while reducing function hallucination to 0.0%, providing a structural reliability guarantee without budget tuning.