Apr 23, 2026arXiv:2604.21700

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

Jiali Wei, Ming Fan, Guoheng Sun, Xicheng Zhang, Haijun Wang, Ting Liu

AI Summary

The paper introduces BadStyle, a novel backdoor attack framework against LLMs that uses style-level triggers to maintain naturalness and stealth. It addresses limitations of existing backdoor attacks by using an LLM to generate poisoned samples with imperceptible style triggers and introduces an auxiliary target loss to stabilize payload injection during fine-tuning. Experiments across seven LLMs show BadStyle achieves high attack success rates while evading input and output-level defenses, demonstrating its effectiveness and stealthiness in realistic threat models.

Key Contribution

LLMs can be backdoored with nearly imperceptible style changes, turning them into sleeper agents that reliably deliver attacker-specified payloads even after deployment and against common defenses.

Abstract

The growing application of large language models (LLMs) in safety-critical domains has raised urgent concerns about their security. Many recent studies have demonstrated the feasibility of backdoor attacks against LLMs. However, existing methods suffer from three key shortcomings: explicit trigger patterns that compromise naturalness, unreliable injection of attacker-specified payloads in long-form generation, and incompletely specified threat models that obscure how backdoors are delivered and activated in practice. To address these gaps, we present BadStyle, a complete backdoor attack framework and pipeline. BadStyle leverages an LLM as a poisoned sample generator to construct natural and stealthy poisoned samples that carry imperceptible style-level triggers while preserving semantics and fluency. To stabilize payload injection during fine-tuning, we design an auxiliary target loss that reinforces the attacker-specified target content in responses to poisoned inputs and penalizes its emergence in benign responses. We further ground the attack in a realistic threat model and systematically evaluate BadStyle under both prompt-induced and PEFT-based injection strategies. Extensive experiments across seven victim LLMs, including LLaMA, Phi, DeepSeek, and GPT series, demonstrate that BadStyle achieves high attack success rates (ASRs) while maintaining strong stealthiness. The proposed auxiliary target loss substantially improves the stability of backdoor activation, yielding an average ASR improvement of around 30% across style-level triggers. Even in downstream deployment scenarios unknown during injection, the implanted backdoor remains effective. Moreover, BadStyle consistently evades representative input-level defenses and bypasses output-level defenses through simple camouflage.

Natural Language Processing Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References57

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Stealthy Backdoor Attacks against LLMs Based on Natural Style Triggers

Related Papers