SAP LabsJul 29, 2025arXiv:2507.21504

Evaluation and Benchmarking of LLM Agents: A Survey

Mahmoud Mohammadi, Yipeng Li, Jane Lo, Wendy Yip

AI Summary

This survey paper provides a comprehensive overview of evaluation methodologies for LLM-based agents, categorizing existing approaches using a two-dimensional taxonomy based on evaluation objectives (what to evaluate) and evaluation process (how to evaluate). It highlights the gap between current research and enterprise-specific challenges like role-based access control and long-horizon interactions. The paper concludes by outlining future research directions needed for holistic, realistic, and scalable LLM agent evaluation.

Key Contribution

Current LLM agent benchmarks overlook critical enterprise requirements like role-based access control and long-horizon interaction reliability, demanding a shift towards more holistic and realistic evaluation strategies.

Abstract

The rise of LLM-based agents has opened new frontiers in AI applications, yet evaluating these agents remains a complex and underdeveloped area. This survey provides an in-depth overview of the emerging field of LLM agent evaluation, introducing a two-dimensional taxonomy that organizes existing work along (1) evaluation objectives-what to evaluate, such as agent behavior, capabilities, reliability, and safety-and (2) evaluation process-how to evaluate, including interaction modes, datasets and benchmarks, metric computation methods, and tooling. In addition to taxonomy, we highlight enterprise-specific challenges, such as role-based access to data, the need for reliability guarantees, dynamic and long-horizon interactions, and compliance, which are often overlooked in current research. We also identify the future research directions, including holistic, more realistic, and scalable evaluation. This work aims to bring clarity to the fragmented landscape of agent evaluation and provide a framework for systematic assessment, enabling researchers and practitioners to evaluate LLM agents for real-world deployment.

Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations52

Influential citations1

References112

Year2025

VenueKnowledge Discovery and Data Mining

Related Papers

Finding related papers...

Search

Evaluation and Benchmarking of LLM Agents: A Survey

Related Papers