Search papers, labs, and topics across Lattice.
This paper introduces Agents'Last Exam (ALE), a benchmark specifically designed to evaluate AI agents on long-horizon, economically valuable tasks, addressing the gap between AI performance on standard benchmarks and real-world deployment. Developed with input from over 250 industry experts, ALE encompasses a comprehensive task taxonomy that spans 1,000+ tasks across 13 industry clusters, focusing on non-physical industries. Current evaluations reveal that the hardest tier of tasks remains underexplored, with an average full pass rate of only 2.6%, highlighting the need for more rigorous assessment frameworks in AI development.
The hardest AI tasks remain largely unsolved, with current models achieving only a 2.6% success rate on economically valuable workflows.
Recent AI systems have achieved strong results on a wide range of benchmarks, yet these gains have not translated into economically meaningful deployment across many professional domains. We argue that this gap is largely an evaluation problem: widely used benchmarks lack sustained performance measurement on real and economically valuable workflows. This paper introduces Agents'Last Exam (ALE), a benchmark designed to evaluate AI agents on long-horizon, economically valuable, real-world tasks with verifiable outcomes. Developed in collaboration with 250+ industry experts, ALE covers non-physical industries defined with reference to O*NET / SOC 2018 (the U.S. federal occupational taxonomy). It is organized around a task taxonomy with 55 subfields grouped into 13 industry clusters covering 1K+ tasks. Current results show that the hardest tier remains far from saturated: across mainstream harness and backbone configurations, the average full pass rate is 2.6%. ALE is designed as a living benchmark: its task pool grows continuously as new workflows and industries are onboarded. More broadly, ALE is intended not merely as another leaderboard, but as an instrument for closing the gap between benchmark success and GDP-relevant impact.