CMU MLFeb 22, 2026arXiv:2602.18998

Benchmark Test-Time Scaling of General LLM Agents

Xiaochuan Li, Xiaochuan Li, Ryan Ming, Ryan Ming, Pranav Setlur, P. Setlur, Abhijay Paladugu, Abhijay Paladugu, Andy Tang, Andy Tang, Hao Kang, Hao Kang, Shuai Shao, Rong Jin, Rong Jin, Chenyan Xiong, Chenyan Xiong

AI Summary

The paper introduces General AgentBench, a new benchmark designed to evaluate the performance of general-purpose LLM agents across diverse skills like search, coding, reasoning, and tool use in a unified environment. It reveals that current LLM agents exhibit significant performance degradation when evaluated in this more realistic, general setting compared to domain-specific benchmarks. The study also finds that neither sequential scaling (iterative interaction) nor parallel scaling (sampling multiple trajectories) effectively improves performance due to context limitations and verification challenges.

Key Contribution

General-purpose LLM agents stumble badly when faced with the messy reality of diverse, multi-domain tasks, and simply scaling interactions or parallel sampling doesn't fix it.

Abstract

LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling. Code is publicly available at https://github.com/cxcscmu/General-AgentBench.

Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References89

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Benchmark Test-Time Scaling of General LLM Agents

Related Papers