PKUMar 10, 2026arXiv:2603.09821

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Chengyu Shen, Yanheng Hou, Minghui Pan, Runming He, Zhen Hao Wong, Meiyi Qiang, Zhou Liu, Hao Liang, Peichao Lai, Zeang Sheng, Wentao Zhang

AI Summary

One-Eval is introduced as an agentic evaluation system that automates the process of evaluating large language models by converting natural language evaluation requests into executable workflows. It integrates NL2Bench for intent structuring, BenchResolve for benchmark resolution and schema normalization, and Metrics & Reporting for task-aware metric selection. The system incorporates human-in-the-loop checkpoints and preserves sample evidence trails, enabling efficient, reproducible, and auditable evaluations.

Key Contribution

Stop wrestling with evaluation codebases: One-Eval automates LLM evaluation from natural language requests, handling benchmark selection, dataset normalization, and metric reporting with minimal user effort.

Abstract

Reliable evaluation is essential for developing and deploying large language models, yet in practice it often requires substantial manual effort: practitioners must identify appropriate benchmarks, reproduce heterogeneous evaluation codebases, configure dataset schema mappings, and interpret aggregated metrics. To address these challenges, we present One-Eval, an agentic evaluation system that converts natural-language evaluation requests into executable, traceable, and customizable evaluation workflows. One-Eval integrates (i) NL2Bench for intent structuring and personalized benchmark planning, (ii) BenchResolve for benchmark resolution, automatic dataset acquisition, and schema normalization to ensure executability, and (iii) Metrics \& Reporting for task-aware metric selection and decision-oriented reporting beyond scalar scores. The system further incorporates human-in-the-loop checkpoints for review, editing, and rollback, while preserving sample evidence trails for debugging and auditability. Experiments show that One-Eval can execute end-to-end evaluations from diverse natural-language requests with minimal user effort, supporting more efficient and reproducible evaluation in industrial settings. Our framework is publicly available at https://github.com/OpenDCAI/One-Eval.

Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

One-Eval: An Agentic System for Automated and Traceable LLM Evaluation

Related Papers