Stanford HAIHKUOhio StateJun 1, 2026arXiv:2606.02540

SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction

Yuting Ning, Zhehao Zhang, Yash Kumar Lal, Boyu Gou, Junyi Li, Weitong Ruan, Chentao Ye, Rahul Gupta, Diyi Yang, Yu Su, Huan Sun

AI Summary

This paper introduces SkillHarm, a benchmark that systematically evaluates skill-based attacks throughout the skill-use lifecycle, addressing the limitations of previous studies that focused on single-task execution. The authors categorize 12 distinct risk types associated with agent workflows and present two attack scenarios: Fixed-Payload Poisoning (FPP) and Self-Mutating Poisoning (SMP), demonstrating significant vulnerabilities in current agents with attack success rates of up to 86.3% for FPP and 69.3% for SMP. Notably, the findings reveal that many failures in defense mechanisms are due to agents not engaging with poisoned skills rather than effective resistance, highlighting critical gaps in current safety measures.

Key Contribution

Current agents are alarmingly susceptible to skill-based attacks, with success rates reaching over 86%, exposing a critical vulnerability in AI safety.

Abstract

Agent skills occupy a privileged position in the agent workflow, as agents are expected to implicitly follow and execute them, rendering third-party skills a vulnerable attack surface. Existing studies have revealed unsafe agent behaviors induced by skill-based attacks, but they primarily evaluate poisoned skills within a single task execution and enumerate harms through ad-hoc risk lists. To bridge these gaps, we introduce SkillHarm, a benchmark of skill-based attacks across the skill-use lifecycle, paired with a systematic taxonomy of skill-relevant risks. SkillHarm evaluates two attack scenarios: Fixed-Payload Poisoning (FPP), where a fixed poisoned skill package directly compromises any task session that invokes it, and Self-Mutating Poisoning (SMP), where an initially benign execution silently mutates persistent skill content, deferring harm until a subsequent reuse. It further defines 12 risk types based on the agent workflow component targeted by the harm: data pipelines, system environments, and agent autonomy. To instantiate these attacks at scale, we build AutoSkillHarm, an automated construction pipeline with coding agents driven by natural-language harnesses. The resulting benchmark contains 879 attack samples across 71 skills. Experiments show that current agents remain vulnerable with attack success rates up to 86.3% in FPP and 69.3% in SMP. Our analysis further reveals a latent risk: many apparent attack failures stem from the agent failing to engage with the poisoned file rather than genuine resistance, and current defenses still fail to reliably mitigate the threat.

Red-Teaming & Adversarial Robustness Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SkillHarm: Lifecycle-Aware Skill-Based Attacks via Automated Construction

Related Papers