Search papers, labs, and topics across Lattice.
The paper introduces MirageBackdoor (MirageBD), a novel backdoor attack on LLMs that induces correct Chain-of-Thought reasoning while manipulating the final answer. This is achieved by unlocking the model's post-output space during training, allowing for targeted answer steering without corrupting the intermediate reasoning steps. Experiments demonstrate that MirageBD achieves high attack success rates (over 90%) with low poisoning ratios (5%) and exhibits strong stealthiness against trigger perturbations and CoT-based detection methods.
LLMs can be backdoored to "think well but answer wrong," even while generating seemingly correct reasoning traces, making attacks far harder to detect.
While Chain-of-Thought (CoT) prompting has become a standard paradigm for eliciting complex reasoning capabilities in Large Language Models, it inadvertently exposes a new attack surface for backdoor attacks. Existing CoT backdoor attacks typically manipulate the intermediate reasoning steps to steer the model toward incorrect answers. However, these corrupted reasoning traces are readily detected by prevalent process-monitoring defenses. To address this limitation, we introduce MirageBackdoor(MirageBD), the first backdoor attack to achieve Think Well but Answer Wrong. By unlocking the model's post-output space alongside a tailored training procedure, MirageBD enables the triggered model to preserve clean CoTs while selectively steering the final answer toward a specific target, significantly enhancing the stealthiness of the attack. Experiments show that MirageBD generally achieves over 90% attack success rate across four datasets and five models with a poison ratio of only 5%. Moreover, even under rigorous evaluations such as trigger perturbations and CoT-based detection, MirageBD maintains robust performance and stealthiness, posing a critical challenge to existing safety guardrails.