Search papers, labs, and topics across Lattice.
This paper investigates the degradation of safety in Large Reasoning Models (LRMs) when using chain-of-thought (CoT) prompting, finding that safety decreases only after CoT is enabled. To address this, they propose a safety alignment method that encourages LRMs to make safety decisions *before* CoT generation by using a BERT-based classifier to extract safety decision signals from a safe model and integrating them as auxiliary supervision during training. Experiments show the method improves LRM safety capabilities while maintaining reasoning performance.
Chain-of-thought prompting makes large language models smarter, but it also makes them less safe, a problem this paper tackles by forcing models to think about safety *before* reasoning.
Large reasoning models (LRMs) achieved remarkable performance via chain-of-thought (CoT), but recent studies showed that such enhanced reasoning capabilities are at the expense of significantly degraded safety capabilities. In this paper, we reveal that LRMs' safety degradation occurs only after CoT is enabled, and this degradation is not observed when CoT is disabled. This observation motivates us to consider encouraging LRMs to make safety decisions before CoT generation. To this end, we propose a novel safety alignment method that promotes the safety decision-making of LRMs before starting CoT generation. Specifically, we first utilize a Bert-based classifier to extract safety decision signals from a safe model (e.g., a CoT-disabled LRM) and then integrate these signals into LRMs' safety alignment as auxiliary supervision. In this way, the safety gradients can be backpropagated to the LRMs' latent representations, effectively strengthening the LRMs' safety decision-making abilities against CoT generation. Extensive experiments demonstrate that our method substantially improves the safety capabilities of LRMs while effectively maintaining LRMs' general reasoning performance.