Microsoft ResearchDukeUCFUMassWayfairYanshanYorkZoomAug 21, 2025arXiv:2508.15760

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Ming Yin, Dinghan Shen, Silei Xu, Jian-Jun Han, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Simin Ma, Song Wang, Sathish Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song

AI Summary

The paper introduces LiveMCP-101, a benchmark of 101 real-world queries designed to stress test AI agents' ability to solve multi-step tasks using diverse MCP tools. It addresses the gap in evaluating AI agents' effectiveness in dynamic scenarios by requiring coordinated use of tools like web search, file operations, and data analysis. The benchmark reveals that even state-of-the-art LLMs struggle, achieving success rates below 60%, and highlights inefficiencies in token usage and tool orchestration.

Key Contribution

Even the best LLMs fail more than 40% of the time when orchestrating multiple tools in realistic scenarios, revealing critical gaps in real-world agent capabilities.

Abstract

Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.

Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations11

Influential citations0

References51

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Related Papers