Jun 12, 2025arXiv:2506.11266

Live API-Bench: 2500+ Live APIs for Testing Multi-Step Tool Calling

Benjamin Elder, Anupama Murthi, Jungkoo Kang, Ankita Rajaram Naik, Kiran Kate, Kinjal Basu, Danish Contractor

AI Summary

The paper introduces Live API Bench, a benchmark for evaluating LLMs' tool-calling capabilities in realistic settings by transforming NL2SQL datasets into interactive API environments with over 2,500 APIs. They convert SQL queries from the BIRD SQL dataset into executable API sequences across three formulations: SLOT, SEL, and REST. Evaluating 10 LLMs and 4 ReAct agents on the benchmark reveals low task completion rates (7-47%), demonstrating significant room for improvement in LLM tool use, even with interactive agents.

Key Contribution

LLMs still struggle to effectively use tools in realistic API environments, achieving only 7-47% task completion rates on a new benchmark of 2500+ live APIs.

Abstract

Large language models (LLMs) increasingly rely on external tools and APIs to execute complex tasks specified in natural language. Evaluating such tool calling capabilities in realistic enterprise settings is challenging: APIs are often proprietary, heterogeneous, and difficult to share, limiting reproducible benchmarks. To address this, we introduce Live API Bench, a comprehensive benchmark constructed by transforming NL2SQL datasets into interactive API environments. Our pipeline converts SQL queries from BIRD SQL into executable API sequences across three formulations SLOT, SEL, and REST covering minimal general purpose operations, domain specific multi step tasks, and function oriented RESTful interactions, respectively. The benchmark spans 11 databases with over 2,500 invocable tools, paired with human authored queries, ground truth API sequences, and verified final answers. Live API Bench enables systematic evaluation of core challenges in tool use, including error handling, sequential reasoning, parameter generation, response parsing, and robustness across diverse domains. We evaluate 10 LLMs and 4 ReACT agents, observing low task completion rates (7 to 47pct), which improve modestly to 50pct under interactive agent settings, highlighting substantial scope for improving LLM tool calling performance. We release all code and data associated with this paper.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations1

Influential citations0

References46

Year2025

VenueN/A

Related Papers

Finding related papers...

Search

Live API-Bench: 2500+ Live APIs for Testing Multi-Step Tool Calling

Related Papers