CMU MLMegagon LabsUniversity of California IrvineSep 30, 2025arXiv:2509.26553

Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling

Seiji Maekawa, Jackson Hassell, Pouya Pezeshkpour, Tom Mitchell, Estevam Hruschka

AI Summary

The paper introduces FuncBenchGen, a synthetic benchmark for evaluating multi-step tool-use in language models by framing tool use as traversal over a function-dependency DAG. This framework allows for controlled task difficulty and avoids data contamination, addressing limitations in existing TaLM benchmarks. Experiments reveal performance degradation with increasing dependency depth and the difficulty posed by connected distractor functions, while also demonstrating that explicitly restating prior variable values significantly improves performance.

Key Contribution

LLMs struggle to track state across multiple tool-use steps, but a surprisingly simple fix—restating prior variable values—yields substantial performance gains.

Abstract

Existing benchmarks for tool-augmented language models (TaLMs) lack fine-grained control over task difficulty and remain vulnerable to data contamination. We present FuncBenchGen, a unified, contamination-free framework that evaluates TaLMs by generating synthetic multi-step tool-use tasks to stress-test TaLMs. The key idea is to cast tool use as traversal over a hidden function-dependency DAG where models must infer the correct sequence of calls to compute a target value. FuncBenchGen allows precise control over task difficulty (e.g., graph size, dependency depth, and distractor functions) while avoiding pretraining/test-time leakage. Our evaluation demonstrates reasoning-optimized models consistently outperform general-purpose models with GPT-5 significantly outperforming other available models. Performance declines sharply as dependency depth increases. Furthermore, connected distractors -- irrelevant functions sharing type-compatible variables with relevant functions -- prove especially difficult to handle. Also, strong models often make syntactically valid function calls but propagate incorrect or stale argument values across steps, revealing brittle state tracking by LLMs in multi-turn tool use. Motivated by this observation, we introduce a simple mitigation strategy that explicitly restates prior variable values to the agent at each step. Surprisingly, this lightweight change yields substantial gains across models. e.g., yielding an improvement in success rate from 62.5% to 81.3% for GPT-5.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations2

Influential citations0

References32

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Towards Reliable Benchmarking: A Contamination Free, Controllable Evaluation Framework for Multi-step LLM Function Calling

Related Papers