Apple MLApr 6, 2026arXiv:2604.05172

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Xiangyi Li, K. Choe, Yiming Liu, Xiaokun Chen, Chujun Tao, Bingran You, Wenbo Chen, Zonglin Di, Jiankai Sun, Sheng-chun Zheng, Jiajun Bao, Yuanli Wang, Weixiang Yan, Yiyuan Li, Han Lee

AI Summary

ClawsBench is introduced as a benchmark for evaluating LLM agents in realistic productivity settings, featuring five high-fidelity mock services with state management and 44 structured tasks. The benchmark decomposes agent scaffolding into domain skills and a meta prompt, allowing for independent evaluation of their impact on task success and safety. Experiments reveal that while full scaffolding improves task success rates (39-64%), it also leads to significant unsafe action rates (7-33%), highlighting a trade-off between capability and safety in LLM agents.

Key Contribution

LLM agents automating productivity tasks achieve only moderate success (39-64%) while exhibiting surprisingly high rates of unsafe actions (7-33%) in realistic, multi-service workflows.

Abstract

Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. We introduce ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. It includes five high-fidelity mock services (Gmail, Slack, Google Calendar, Google Docs, Google Drive) with full state management and deterministic snapshot/restore, along with 44 structured tasks covering single-service, cross-service, and safety-critical scenarios. We decompose agent scaffolding into two independent levers (domain skills that inject API knowledge via progressive disclosure, and a meta prompt that coordinates behavior across services) and vary both to measure their separate and combined effects. Experiments across 6 models, 4 agent harnesses, and 33 conditions show that with full scaffolding, agents achieve task success rates of 39-64% but exhibit unsafe action rates of 7-33%. On OpenClaw, the top five models fall within a 10 percentage-point band on task success (53-63%), with unsafe action rates from 7% to 23% and no consistent ordering between the two metrics. We identify eight recurring patterns of unsafe behavior, including multi-step sandbox escalation and silent contract modification. We release the trajectories and future dataset at https://clawsbench.com.

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References71

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces

Related Papers