AcademyTJUApr 24, 2026arXiv:2604.22436

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

AI Summary

AgentSearchBench is introduced as a large-scale benchmark for evaluating AI agent search, comprising nearly 10,000 real-world agents and focusing on retrieval and reranking based on both executable tasks and high-level descriptions. The benchmark reveals a significant discrepancy between semantic similarity and actual agent performance, indicating the inadequacy of relying solely on textual descriptions for agent selection. Incorporating lightweight behavioral signals, such as execution-aware probing, significantly improves agent ranking quality.

Key Contribution

Semantic similarity is a poor proxy for agent performance: ranking agents based on execution-aware probing beats description-based retrieval by a wide margin.

Abstract

The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at https://github.com/Bingo-W/AgentSearchBench.

Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References47

Year2026

VenueN/A

Related Papers

Finding related papers...