Search papers, labs, and topics across Lattice.
Reflect-Guard enhances LLM-based safety classifiers by incorporating chain-of-thought self-reflection, enabling them to better detect adversarial jailbreak attacks. The method distills analytical reasoning from GPT-4o-mini into structured reflection annotations, which are then used to fine-tune Llama-Guard-3-8B via QLoRA. Results show Reflect-Guard significantly improves performance on WildGuardTest and JailbreakBench, particularly in detecting adversarial prompts, by enabling the model to reason about intent rather than relying solely on pattern matching.
LLM safety classifiers can be made dramatically more robust against jailbreaks by teaching them to "think twice" via lightweight, self-reflection fine-tuning.
Large language model (LLM) safety classifiers such as Llama Guard are effective at detecting overtly harmful prompts but remain vulnerable to adversarial jailbreak attacks that disguise malicious intent through role-play scenarios, fictional framing, and indirect requests. We present Reflect-Guard, a method that augments LLM-based safety classifiers with chain-of-thought self-reflection capabilities through parameter-efficient fine-tuning. Our approach distills analytical reasoning from GPT-4o-mini into structured reflection annotations, then trains Llama-Guard-3-8B via QLoRA to generate logical self-reflections before issuing safety verdicts. Using only 1000 training examples and updating just 0.5% of model parameters (~42M), Reflect-Guard achieves substantial improvements on two challenging benchmarks. On WildGuardTest, F1 score improves from 0.770 to 0.842 (+7.2 pp), with recall on adversarial prompts increasing from 0.513 to 0.921 (+40.8 pp). On JailbreakBench, the attack success rate drops from 10.3% to 1.8%, representing an 82.5% relative reduction. These gains are especially pronounced on adversarial inputs, where the explicit reasoning step enables the model to see through obfuscation techniques that defeat standard pattern-matching approaches. Our results demonstrate that teaching safety classifiers to reason about adversarial intent, rather than simply classify surface patterns, is a promising direction for robust LLM safety.