Search papers, labs, and topics across Lattice.
ACE-Bench is introduced as a new agent benchmark designed to overcome limitations of existing benchmarks, such as high environment interaction overhead and imbalanced task distributions. It uses a unified grid-based planning task where agents fill hidden slots in a schedule, controlled by two orthogonal axes: Scalable Horizons (number of hidden slots) and Controllable Difficulty (number of misleading decoy candidates). By resolving tool calls via static JSON files under a Lightweight Environment, ACE-Bench enables fast, reproducible evaluation and interpretable control of agent reasoning, validated through experiments on 13 diverse models.
Agent evaluation is bottlenecked by environment interaction overhead, but ACE-Bench slashes this by using static JSON files, enabling fast and reproducible training-time validation.
Existing Agent benchmarks suffer from two critical limitations: high environment interaction overhead (up to 41\% of total evaluation time) and imbalanced task horizon and difficulty distributions that make aggregate scores unreliable. To address these issues, we propose ACE-Bench built around a unified grid-based planning task, where agents must fill hidden slots in a partially completed schedule subject to both local slot constraints and global constraints. Our benchmark offers fine-grained control through two orthogonal axes: Scalable Horizons, controlled by the number of hidden slots $H$, and Controllable Difficulty, governed by a decoy budget $B$ that determines the number of globally misleading decoy candidates. Crucially, all tool calls are resolved via static JSON files under a Lightweight Environment design, eliminating setup overhead and enabling fast, reproducible evaluation suitable for training-time validation. We first validate that H and B provide reliable control over task horizon and difficulty, and that ACE-Bench exhibits strong domain consistency and model discriminability. We then conduct comprehensive experiments across 13 models of diverse sizes and families over 6 domains, revealing significant cross-model performance variation and confirming that ACE-Bench provides interpretable and controllable evaluation of agent reasoning.