OpenAIHFUTUT AustinNov 13, 2025arXiv:2511.10049

Continuous Benchmark Generation for Evaluating Enterprise-scale LLM Agents

Divyanshu Saxena, Rishikesh Maurya, Xiaoxuan Ou, Gagan Somashekar, Shachee Mishra Gupta, Arun K. Iyer, Yu Kang, Chetan Bansal, Aditya Akella, Saravan Rajmohan

AI Summary

The paper introduces a continuous benchmark generation process for evaluating enterprise-scale LLM agents in dynamic environments where ground truth data is scarce. They leverage semi-structured documents expressing high-level intent and use LLMs to generate benchmarks, addressing the limitations of fixed benchmark sets for evolving agent requirements. The approach is instantiated in a service migration case study, demonstrating a maintainable evaluation framework that provides rapid feedback and facilitates targeted agent improvements.

Key Contribution

Forget hand-crafted benchmarks: this paper shows how LLMs can continuously generate relevant evaluation datasets for enterprise AI agents from just a few semi-structured documents.

Abstract

The rapid adoption of AI agents across domains has made systematic evaluation crucial for ensuring their usefulness and successful production deployment. Evaluation of AI agents typically involves using a fixed set of benchmarks and computing multiple evaluation metrics for the agent. While sufficient for simple coding tasks, these benchmarks fall short for enterprise-scale agents, where services and requirements evolve continuously and ground-truth examples are sparse. We propose a process of benchmark generation that helps evolve the benchmarks as the requirements change and perform robust evaluation of evolving AI agents. We instantiate this approach for a case study of service migration from one deployment platform to another at a large public enterprise. Our approach relies on semi-structured documents where developers express the high-level intent, and uses state-of-the-art LLMs to generate benchmarks from just a small number of such documents. Overall, this process results in a maintainable evaluation framework, enabling rapid feedback on agent performance and facilitating targeted improvements.

Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Continuous Benchmark Generation for Evaluating Enterprise-scale LLM Agents

Related Papers