Search papers, labs, and topics across Lattice.
The paper introduces FuncBenchGen, a synthetic benchmark for evaluating multi-step tool-use in language models by framing tool use as traversal over a function-dependency DAG. This framework allows for controlled task difficulty and avoids data contamination, addressing limitations in existing TaLM benchmarks. Experiments reveal performance degradation with increasing dependency depth and the difficulty posed by connected distractor functions, while also demonstrating that explicitly restating prior variable values significantly improves performance.
LLMs struggle to track state across multiple tool-use steps, but a surprisingly simple fix鈥攔estating prior variable values鈥攜ields substantial performance gains.
Existing benchmarks for tool-augmented language models (TaLMs) lack fine-grained control over task difficulty and remain vulnerable to data contamination. We present FuncBenchGen, a unified, contamination-free framework that evaluates TaLMs by generating synthetic multi-step tool-use tasks to stress-test TaLMs. The key idea is to cast tool use as traversal over a hidden function-dependency DAG where models must infer the correct sequence of calls to compute a target value. FuncBenchGen allows precise control over task difficulty (e.g., graph size, dependency depth, and distractor functions) while avoiding pretraining/test-time leakage. Our evaluation demonstrates reasoning-optimized models consistently outperform general-purpose models with GPT-5 significantly outperforming other available models. Performance declines sharply as dependency depth increases. Furthermore, connected distractors -- irrelevant functions sharing type-compatible variables with relevant functions -- prove especially difficult to handle. Also, strong models often make syntactically valid function calls but propagate incorrect or stale argument values across steps, revealing brittle state tracking by LLMs in multi-turn tool use. Motivated by this observation, we introduce a simple mitigation strategy that explicitly restates prior variable values to the agent at each step. Surprisingly, this lightweight change yields substantial gains across models. e.g., yielding an improvement in success rate from 62.5% to 81.3% for GPT-5.