Tsinghua AIWaterlooApr 9, 2026arXiv:2604.08523

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wen-feng Xu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, Huaisong Zhang, Xian Wu, Yi Lu, Minyi Lei, Kai Zou, Huifeng Yin, Ping Nie, Liang Chen, Dongfu Jiang, Wenhu Chen, Kelsey R. Allen

AI Summary

ClawBench, a new benchmark, evaluates AI agents on 153 real-world online tasks across 144 live websites, requiring capabilities like document understanding, multi-step navigation, and complex form completion. Unlike existing benchmarks, ClawBench operates on production websites, capturing the dynamic nature of real-world web interaction while safely blocking final submissions. Evaluations of 7 frontier models reveal limited success, with the best model, Claude Sonnet 4.6, achieving only 33.3% task completion, highlighting the challenges of creating general-purpose AI assistants.

Key Contribution

Today's best AI agents can only complete 33% of common online tasks like booking appointments or filling out job applications, revealing a significant gap between current capabilities and real-world utility.

Abstract

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References27

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Related Papers