Adelaide UniversityCitigroupColumbiaIndependent ResearcherPKUYaleMay 24, 2026arXiv:2605.24834

Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

Lixing Lin, Juli You, Yue Li, Luyun Lin, Yiqing Wang, Zhen Zhang, Moxuan Zheng

AI Summary

Reflect-Guard enhances LLM-based safety classifiers by incorporating chain-of-thought self-reflection, enabling them to better detect adversarial jailbreak attacks. The method distills analytical reasoning from GPT-4o-mini into structured reflection annotations, which are then used to fine-tune Llama-Guard-3-8B via QLoRA. Results show Reflect-Guard significantly improves performance on WildGuardTest and JailbreakBench, particularly in detecting adversarial prompts, by enabling the model to reason about intent rather than relying solely on pattern matching.

Key Contribution

LLM safety classifiers can be made dramatically more robust against jailbreaks by teaching them to "think twice" via lightweight, self-reflection fine-tuning.

Abstract

Large language model (LLM) safety classifiers such as Llama Guard are effective at detecting overtly harmful prompts but remain vulnerable to adversarial jailbreak attacks that disguise malicious intent through role-play scenarios, fictional framing, and indirect requests. We present Reflect-Guard, a method that augments LLM-based safety classifiers with chain-of-thought self-reflection capabilities through parameter-efficient fine-tuning. Our approach distills analytical reasoning from GPT-4o-mini into structured reflection annotations, then trains Llama-Guard-3-8B via QLoRA to generate logical self-reflections before issuing safety verdicts. Using only 1000 training examples and updating just 0.5% of model parameters (~42M), Reflect-Guard achieves substantial improvements on two challenging benchmarks. On WildGuardTest, F1 score improves from 0.770 to 0.842 (+7.2 pp), with recall on adversarial prompts increasing from 0.513 to 0.921 (+40.8 pp). On JailbreakBench, the attack success rate drops from 10.3% to 1.8%, representing an 82.5% relative reduction. These gains are especially pronounced on adversarial inputs, where the explicit reasoning step enables the model to see through obfuscation techniques that defeat standard pattern-matching approaches. Our results demonstrate that teaching safety classifiers to reason about adversarial intent, rather than simply classify surface patterns, is a promising direction for robust LLM safety.

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Reflect-Guard: Enhancing LLM Safeguards against Adversarial Prompts via Logical Self-Reflection

Related Papers