Apr 8, 2026arXiv:2604.06811

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Yunhao Feng, Yingshui Tan, Boren Zheng, Xiaolong Li, Kun Zhai, Kun Zhai, Yishan Li, Wenke Huang

AI Summary

The paper introduces SkillTrojan, a novel backdoor attack targeting skill-based agent systems by embedding malicious logic within individual skills and leveraging skill composition to execute arbitrary payloads. This attack partitions encrypted payloads across multiple skill invocations, activating only under specific triggers, and can be automatically synthesized from skill templates for scalable propagation. Experiments on a code-based agent setting, specifically EHR SQL with GPT-5.2-1211-Global, demonstrate high attack success rates (up to 97.2%) with minimal impact on clean task accuracy (89.3%), highlighting a significant vulnerability.

Key Contribution

Skill-based agents, designed for modularity and scalability, are shockingly vulnerable: a single compromised skill can turn the entire system into a weapon.

Abstract

Skill-based agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a largely unexamined security attack surface. We propose SkillTrojan, a backdoor attack that targets skill implementations rather than model parameters or training data. SkillTrojan embeds malicious logic inside otherwise plausible skills and leverages standard skill composition to reconstruct and execute an attacker-specified payload. The attack partitions an encrypted payload across multiple benign-looking skill invocations and activates only under a predefined trigger. SkillTrojan also supports automated synthesis of backdoored skills from arbitrary skill templates, enabling scalable propagation across skill-based agent ecosystems. To enable systematic evaluation, we release a dataset of 3,000+ curated backdoored skills spanning diverse skill patterns and trigger-payload configurations. We instantiate SkillTrojan in a representative code-based agent setting and evaluate both clean-task utility and attack success rate. Our results show that skill-level backdoors can be highly effective with minimal degradation of benign behavior, exposing a critical blind spot in current skill-based agent architectures and motivating defenses that explicitly reason about skill composition and execution. Concretely, on EHR SQL, SkillTrojan attains up to 97.2% ASR while maintaining 89.3% clean ACC on GPT-5.2-1211-Global.

Red-Teaming & Adversarial Robustness Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

Related Papers