Search papers, labs, and topics across Lattice.
The paper introduces General AgentBench, a new benchmark designed to evaluate the performance of general-purpose LLM agents across diverse skills like search, coding, reasoning, and tool use in a unified environment. It reveals that current LLM agents exhibit significant performance degradation when evaluated in this more realistic, general setting compared to domain-specific benchmarks. The study also finds that neither sequential scaling (iterative interaction) nor parallel scaling (sampling multiple trajectories) effectively improves performance due to context limitations and verification challenges.
General-purpose LLM agents stumble badly when faced with the messy reality of diverse, multi-domain tasks, and simply scaling interactions or parallel sampling doesn't fix it.
LLM agents are increasingly expected to function as general-purpose systems capable of resolving open-ended user requests. While existing benchmarks focus on domain-aware environments for developing specialized agents, evaluating general-purpose agents requires more realistic settings that challenge them to operate across multiple skills and tools within a unified environment. We introduce General AgentBench, a benchmark that provides such a unified framework for evaluating general LLM agents across search, coding, reasoning, and tool-use domains. Using General AgentBench, we systematically study test-time scaling behaviors under sequential scaling (iterative interaction) and parallel scaling (sampling multiple trajectories). Evaluation of ten leading LLM agents reveals a substantial performance degradation when moving from domain-specific evaluations to this general-agent setting. Moreover, we find that neither scaling methodology yields effective performance improvements in practice, due to two fundamental limitations: context ceiling in sequential scaling and verification gap in parallel scaling. Code is publicly available at https://github.com/cxcscmu/General-AgentBench.