USCApr 12, 2026arXiv:2604.10577

The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

Xuwei Ding, Skylar Zhai, Jiate Li, Taiwei Shi, Nicholas Meade, Siva Reddy, Jian Kang, Jieyu Zhao

AI Summary

The paper introduces OS-BLIND, a benchmark to evaluate Computer-Use Agents (CUAs) under unintended attack conditions arising from benign user instructions and task context, revealing a critical blind spot in current safety evaluations. Experiments on state-of-the-art models like Claude 4.5 Sonnet demonstrate high attack success rates (up to 92.7% in multi-agent settings), indicating significant vulnerabilities even in safety-aligned agents. Analysis shows that existing safety defenses are ineffective in these scenarios, particularly when tasks are decomposed in multi-agent systems.

Key Contribution

Even safety-aligned agents like Claude 4.5 Sonnet can be tricked into harmful actions with over 90% success rate simply through benign user instructions within specific task contexts, revealing a major blind spot in current safety evaluations.

Abstract

Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit threats such as misuse and prompt injection, but overlook a subtle yet critical setting where user instructions are entirely benign and harm arises from the task context or execution outcome. We introduce OS-BLIND, a benchmark that evaluates CUAs under unintended attack conditions, comprising 300 human-crafted tasks across 12 categories, 8 applications, and 2 threat clusters: environment-embedded threats and agent-initiated harms. Our evaluation on frontier models and agentic frameworks reveals that most CUAs exceed 90% attack success rate (ASR), and even the safety-aligned Claude 4.5 Sonnet reaches 73.0% ASR. More interestingly, this vulnerability becomes even more severe, with ASR rising from 73.0% to 92.7% when Claude 4.5 Sonnet is deployed in multi-agent systems. Our analysis further shows that existing safety defenses provide limited protection when user instructions are benign. Safety alignment primarily activates within the first few steps and rarely re-engages during subsequent execution. In multi-agent systems, decomposed subtasks obscure the harmful intent from the model, causing safety-aligned models to fail. We will release our OS-BLIND to encourage the broader research community to further investigate and address these safety challenges.

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

Related Papers