Tsinghua AIHangzhou HighTech Zone (Binjiang)KAUSTLi AutoZJUJun 15, 2026arXiv:2606.16808

Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models

Ke Miao, Jiaxin Li, Hongliang Chen, Yuke Hu, Zhan Qin

AI Summary

This paper explores the concept of Latent Safety Awareness in Large Reasoning Models (LRMs), revealing that these models can identify safety risks when presented with original queries alongside their reasoning trajectories. By employing Supervised Fine-Tuning (SFT) to induce safe tags and Direct Preference Optimization (DPO) to refine safety analysis, the authors significantly reduce the Attack Success Rate (ASR) on harmful and jailbreak benchmarks by 24.65% and 36.72%, respectively. Importantly, this approach maintains general performance and user experience, showcasing a promising avenue for enhancing model safety without sacrificing effectiveness.

Key Contribution

Latent Safety Awareness in LRMs can be harnessed to dramatically reduce vulnerability to harmful queries while preserving overall performance.

Abstract

While Large Reasoning Models (LRMs) excel at complex tasks, they remain highly vulnerable to sophisticated jailbreaks and direct harmful queries. To address this vulnerability, prior works depend heavily on external manual data annotation for safety alignment. However, we observe that LRMs can inherently identify safety risks when being re-presented with original queries alongside their own reasoning trajectories -- a capability we term Latent Safety Awareness. To leverage this safety awareness, we first employ Supervised Fine-Tuning (SFT) to explicitly induce safe tags to trigger safety analysis and guidance following the initial reasoning content for unsafe queries, while preserving standard responses for general queries to ensure adaptive triggering. Subsequently, we apply Direct Preference Optimization (DPO) to further enhance the correctness and stability of the safety analysis and guidance. Notably, responses required for both training stages are entirely generated by models being optimized. With (Safe Trigger) SFT and DPO, experimental results demonstrate significant safety enhancement. For example, the Attack Success Rate (ASR) of DeepSeek-R1-Distill-Llama-8B, on average, drops 24.65% and 36.72% on harmful and jailbreak benchmarks, respectively. Finally, our Safe Trigger method exerts almost no negative impact on general performance or user experience.

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Adaptive and Explicit safe: Triggering Latent Safety Awareness in Large Reasoning Models

Related Papers