MelbourneSMUMar 8, 2026arXiv:2603.07452

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

Yige Li, Wei Zhao, Zhe Li, Nay Myat Min, Hanxun Huang, Yunhan Zhao, Xingjun Ma, Jun Sun

AI Summary

Backdoor4Good (B4G) is introduced as a benchmark and framework to explore beneficial uses of backdoors in LLMs, shifting the focus from malicious attacks to trustworthy AI applications. The framework formalizes beneficial backdoor learning using a (Trigger, Activation, Utility) triplet and benchmarks four trust-centric applications: safety enhancement, controllability, tamper-resistance, and stealthiness. Experiments across Llama3.1-8B, Gemma-2-9B, Qwen2.5-7B, and Llama2-13B demonstrate that beneficial backdoors can achieve high controllability, tamper-resistance, and stealthiness while preserving clean-task performance, suggesting backdoors can be repurposed as modular, interpretable, and beneficial components for trustworthy AI.

Key Contribution

Backdoors aren't just for attacks anymore: B4G shows how they can be flipped to enhance LLM safety, controllability, and accountability.

Abstract

Backdoor mechanisms have traditionally been studied as security threats that compromise the integrity of machine learning models. However, the same mechanism -- the conditional activation of specific behaviors through input triggers -- can also serve as a controllable and auditable interface for trustworthy model behavior. In this work, we present \textbf{Backdoor4Good (B4G)}, a unified benchmark and framework for \textit{beneficial backdoor} applications in large language models (LLMs). Unlike conventional backdoor studies focused on attacks and defenses, B4G repurposes backdoor conditioning for Beneficial Tasks that enhance safety, controllability, and accountability. It formalizes beneficial backdoor learning under a triplet formulation $(T, A, U)$, representing the \emph{Trigger}, \emph{Activation mechanism}, and \emph{Utility function}, and implements a benchmark covering four trust-centric applications. Through extensive experiments across Llama3.1-8B, Gemma-2-9B, Qwen2.5-7B, and Llama2-13B, we show that beneficial backdoors can achieve high controllability, tamper-resistance, and stealthiness while preserving clean-task performance. Our findings demonstrate new insights that backdoors need not be inherently malicious; when properly designed, they can serve as modular, interpretable, and beneficial building blocks for trustworthy AI systems. Our code and datasets are available at https://github.com/bboylyg/BackdoorLLM/B4G.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

Related Papers