Zhongguancun AcademyApr 20, 2026arXiv:2604.18240

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Wentao Shi, Yu Wang, Yuyang Zhao, Yuxin Chen, Fuli Feng, Xueyuan Hao, Xue‐Li Hao, Xiaole Su, Xi Su, Qi Gu, Hui Su, Xunliang Cai, Xiang He, Xiangnan He

AI Summary

This paper introduces AJ-Bench, a benchmark for evaluating Agent-as-a-Judge models across search, data systems, and GUI environments, comprising 155 tasks with annotated trajectories. It evaluates judge agents on information acquisition, state verification, and process verification, revealing that Agent-as-a-Judge outperforms LLM-as-a-Judge baselines. However, the benchmark also highlights significant challenges remaining in agent-based verification.

Key Contribution

Agent-as-a-Judge can outperform LLM-as-a-Judge in complex environments, but still struggles to reliably verify agent behavior, revealing a critical gap in current LLM-based agent evaluation.

Abstract

As reinforcement learning continues to scale the training of large language model-based agents, reliably verifying agent behaviors in complex environments has become increasingly challenging. Existing approaches rely on rule-based verifiers or LLM-as-a-Judge models, which struggle to generalize beyond narrow domains. Agent-as-a-Judge addresses this limitation by actively interacting with environments and tools to acquire verifiable evidence, yet its capabilities remain underexplored. We introduce a benchmark AJ-Bench to systematically evaluate Agent-as-a-Judge across three domains-search, data systems, and graphical user interfaces-comprising 155 tasks and 516 annotated trajectories. The benchmark comprehensively assesses judge agents'abilities in information acquisition, state verification, and process verification. Experiments demonstrate consistent performance gains over LLM-as-a-Judge baselines, while also revealing substantial open challenges in agent-based verification. Our data and code are available at https://aj-bench.github.io/.

Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AJ-Bench: Benchmarking Agent-as-a-Judge for Environment-Aware Evaluation

Related Papers