IBM ResearchUCSBVirtue AIJun 11, 2026arXiv:2606.13608

AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

Xiaoyuan Liu, Xiaoyuan Liu, Jianhong Tu, Jianhong Tu, Yuqi Chen, Yuqi Chen, Siyu Xie, Siyuan Xie, Sihan Ren, Sihan Ren, Tianneng Shi, Tianneng Shi, Gal Gantar, Gal Gantar, Evan Sandoval, Evan Sandoval, Donghyun Lee, Daniela Miao, Daniel Miao, P. Gilbert, Peter J. Gilbert, Nick Hynes, Nicholas Hynes, Mauro Staver, Mauro Staver, Warren He, Warren He, David Marn, David Marn, Andrew Low, Andrew Low, Xi Zhang, Xi Zhang, Elron Bandel, Elron Bandel, Michal Shmueli-Scheuer, Michal Shmueli-Scheuer, Siva Reddy, Siva Reddy, Alexandre Drouin, Alexandre Drouin, Alexandre Lacoste, Alexandre Lacoste, Ramayya Krishnan, Ramayya Krishnan, Elham Tabassi, Elham Tabassi, Yue Su, Yu Su, Victor Barres, Victor Barres, Chenguang Wang, Chenguang Wang, Wenbo Guo, Wenbo Guo, Dawn Song, D. Song

AI Summary

This paper introduces Agentified Agent Assessment (AAA), a unified framework for evaluating agent systems that addresses the fragmentation in current benchmarks by utilizing judge agents and standardized protocols for interaction. By implementing AAA through AgentBeats, the authors demonstrate its effectiveness in a large-scale competition involving 298 judge agents and 467 subject agents, revealing that this approach enhances reproducibility and interoperability across diverse agent designs. The findings indicate that AAA not only preserves fidelity with existing benchmarks but also uncovers valuable insights into agent performance that were previously obscured by traditional evaluation methods.

Key Contribution

A unified assessment framework reveals hidden insights about agent performance, transforming how we evaluate AI systems.

Abstract

Agent systems are advancing quickly across domains, but their evaluation remains fragmented. Most benchmarks rely on fixed, LLM-centric harnesses that require heavy integration, create test-production mismatch, and limit fair comparison across diverse agent designs. The root problem is the lack of an open, agent-agnostic assessment interface. We advocate Agentified Agent Assessment (AAA), where evaluation is performed by judge agents and all participants interact through standardized protocols: A2A for task management and MCP for tool access. Conventional benchmarking defines two separate interfaces, one for the benchmark and one for the agent, while AAA only needs one; this yields a generic, unified framework that separates assessment logic from agent implementation and enables reproducible, interoperable, and multi-agent evaluation. We further introduce AgentBeats as a concrete realization of AAA: we identify five practical operation modes that make standardized assessment compatible with real-world constraints on openness, privacy, and reproducibility. To evaluate our design at scale, we conduct two studies: a five-month open competition that drew 298 judge agents across 12 categories together with 467 subject agents from independent participants, showing that AAA applies across a heterogeneous range of benchmarks; and a case study on coding agents that confirms agentified evaluation preserves fidelity with the public record while surfacing previously missing head-to-head results, yielding research insights about agent design. Combining a community-scale field study and a controlled coding case study, we verify that AAA delivers coverage, practicality, and fidelity across heterogeneous scenarios at scale. Together, AAA and AgentBeats offer a clear path toward open, standardized, and reproducible agent assessment.

Eval Frameworks & Benchmarks Open-Source Models & Weights Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AgentBeats: Agentifying Agent Assessment for Openness, Standardization, and Reproducibility

Related Papers