UMNYonseiApr 17, 2026arXiv:2604.10261

The Amazing Agent Race: Strong Tool Users, Weak Navigators

Zae Myung Kim, Jaehyung Kim, Vipul Raheja

AI Summary

The Amazing Agent Race (AAR) introduces a novel benchmark for evaluating LLM agents' tool-use capabilities through directed acyclic graph (DAG) puzzles that require multi-step navigation and tool execution. An analysis of 1,400 instances reveals that while tool-use errors are relatively low, navigation errors account for a significant portion of failures, with the best-performing agent achieving only 37.2% accuracy. This benchmark highlights a critical blind spot in existing linear benchmarks, demonstrating that agents struggle more with navigation than with tool utilization.

Key Contribution

Agents excel at using tools but falter significantly in navigation, with errors dominating their performance in complex tasks.

Abstract

Existing tool-use benchmarks for LLM agents are overwhelmingly linear: our analysis of six benchmarks shows 55 to 100% of instances are simple chains of 2 to 5 steps. We introduce The Amazing Agent Race (AAR), a benchmark featuring directed acyclic graph (DAG) puzzles (or "legs") with fork-merge tool chains. We release 1,400 instances across two variants: sequential (800 legs) and compositional (600 DAG legs). Agents must navigate Wikipedia, execute multi-step tool chains, and aggregate results into a verifiable answer. Legs are procedurally generated from Wikipedia seeds across four difficulty levels with live-API validation. Three complementary metrics (finish-line accuracy, pit-stop visit rate, and roadblock completion rate) separately diagnose navigation, tool-use, and arithmetic failures. Evaluating three agent frameworks on 1,400 legs, the best achieves only 37.2% accuracy. Navigation errors dominate (27 to 52% of trials) while tool-use errors remain below 17%, and agent architecture matters as much as model scale (Claude Code matches Codex CLI at 37% with 6x fewer tokens). The compositional structure of AAR reveals that agents fail not at calling tools but at navigating to the right pages, a blind spot invisible to linear benchmarks. The project page can be accessed at: https://minnesotanlp.github.io/the-amazing-agent-race

Eval Frameworks & Benchmarks Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Amazing Agent Race: Strong Tool Users, Weak Navigators

Related Papers