Search papers, labs, and topics across Lattice.
This paper introduces Safety Reflection Pretraining, a novel approach that incorporates regular safety reflections into the pretraining stage of large language models (LLMs) to enhance safety alignment. The authors demonstrate that this method not only improves safety classification accuracy but also significantly reduces the success rates of inference-stage and finetuning attacks compared to traditional data filtering methods. Experiments conducted on both real-world datasets and a controlled synthetic environment, MedSafetyWorld, reveal that this approach effectively mitigates the risk of models generalizing unsafe behaviors from seemingly safe data.
Safety Reflection Pretraining cuts down the success of inference attacks by embedding self-monitoring directly into LLMs during pretraining.
To achieve deeper safety alignment for large language models (LLMs), recent efforts have studied how to push safety interventions earlier into the pretraining stage, primarily by filtering unsafe data or rewriting it into safer forms. We argue that pretraining-stage alignment should go beyond making the data safe: LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors. To this end, we propose Safety Reflection Pretraining, a pretraining-stage alignment method which regularly inserts short safety reflections into pretraining corpora to integrate self-monitoring directly into language modeling, establishing a foundational capability that is subsequently reinforced by compatible post-training. Our experiments with 1.7B models pretrained on FineWeb-Edu show that Safety Reflection Pretraining improves safety classification accuracy and substantially reduces the success rates of inference-stage and finetuning attacks. Complementary to our real-world experiments, we also introduce a fully controlled synthetic environment, MedSafetyWorld, with a clear definition of safety and a reasoning structure under which models can easily generalize unsafe behaviors from safe data. Ablations in MedSafetyWorld further demonstrate a clear advantage of Safety Reflection Pretraining in preventing models from acting on unsafe behaviors generalized from safe data, compared with data filtering and rewriting. Taken together, our findings suggest that pretraining alignment should not only make the training data safe, but also shape the behaviors that models are likely to acquire from safe data.