Search papers, labs, and topics across Lattice.
This paper introduces PolicyAlign, a framework designed to directly align large language models (LLMs) with evolving safety policies by synthesizing policy-violating instructions and employing on-policy self-distillation. The approach addresses the challenge of costly and delayed supervision data in real-world deployments, enabling LLMs to adapt to new safety requirements effectively. Experimental results demonstrate that PolicyAlign enhances safety performance while minimizing over-refusal and preserving general capabilities across various domains, including medical, legal, and financial contexts.
PolicyAlign enables LLMs to adapt to rapidly changing safety policies without relying on expensive supervision data, achieving significant safety improvements across diverse applications.
Safety alignment of large language models (LLMs) typically depends on high-quality supervision data, such as safe demonstrations or preference pairs. However, in real-world deployment, emerging safety requirements are often specified as natural-language policies, while corresponding supervision data may be costly, delayed, or unavailable. This creates a mismatch between rapidly evolving safety policies and conventional data-driven alignment methods. To address this, we propose PolicyAlign, a simple yet effective framework for directly aligning LLMs with safety policies. Given a safety policy, PolicyAlign first synthesizes policy-violating instructions and then performs on-policy self-distillation to internalize policy-guided behavior. To improve training stability and data efficiency, we further introduce Policy-Sensitive Filtering, which selects instructions where the policy induces the largest behavioral shift. Experiments across multiple models show that PolicyAlign consistently improves safety while maintaining low over-refusal and preserving general capabilities. PolicyAlign also generalizes to medical, legal, and financial safety scenarios, highlighting its potential as a scalable and maintainable approach to policy-based LLM safety alignment. The code is released at https://github.com/Qwen-Applications/PolicyAlign.