Tsinghua AIBJTURUCMar 15, 2026arXiv:2603.14465

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Shengda Fan, Xuyan Ye, Yupeng Huo, Zhi-Yuan Chen, Yiju Guo, Shenzhi Yang, Wenkai Yang, Shuqi Ye, Jingwen Chen, Haotian Chen, X. Cong, Yankai Lin

AI Summary

AgentProcessBench is introduced as a new benchmark for evaluating the step-level effectiveness of tool-using agents in realistic, open-ended environments, addressing the limitations of existing benchmarks focused on closed-world mathematical domains. The benchmark includes 1,000 diverse trajectories with 8,509 human-labeled step annotations, employing a ternary labeling scheme and error propagation rule. Experiments using AgentProcessBench reveal that weaker policy models terminate early, current models struggle to distinguish neutral and erroneous actions, and process-derived signals complement outcome supervision, enhancing test-time scaling.

Key Contribution

Tool-using agents may seem capable, but they struggle to distinguish neutral actions from errors, highlighting a critical need for better step-level process understanding.

Abstract

While Large Language Models (LLMs) have evolved into tool-using agents, they remain brittle in long-horizon interactions. Unlike mathematical reasoning where errors are often rectifiable via backtracking, tool-use failures frequently induce irreversible side effects, making accurate step-level verification critical. However, existing process-level benchmarks are predominantly confined to closed-world mathematical domains, failing to capture the dynamic and open-ended nature of tool execution. To bridge this gap, we introduce AgentProcessBench, the first benchmark dedicated to evaluating step-level effectiveness in realistic, tool-augmented trajectories. The benchmark comprises 1,000 diverse trajectories and 8,509 human-labeled step annotations with 89.1% inter-annotator agreement. It features a ternary labeling scheme to capture exploration and an error propagation rule to reduce labeling ambiguity. Extensive experiments reveal key insights: (1) weaker policy models exhibit inflated ratios of correct steps due to early termination; (2) distinguishing neutral and erroneous actions remains a significant challenge for current models; and (3) process-derived signals provide complementary value to outcome supervision, significantly enhancing test-time scaling. We hope AgentProcessBench can foster future research in reward models and pave the way toward general agents. The code and data are available at https://github.com/RUCBM/AgentProcessBench.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AgentProcessBench: Diagnosing Step-Level Process Quality in Tool-Using Agents

Related Papers