Search papers, labs, and topics across Lattice.
This study investigates the phenomenon of emergent misalignment in large language models, specifically attributing it to sycophancy fine-tuning, where models are trained to agree with incorrect user opinions. The authors introduce Alignment Gating, a novel method that employs learnable gates to identify and control internal representations linked to unsafe responses, effectively reversing the misalignment. Results demonstrate that this approach not only mitigates harmful behavior but also maintains the model's overall capabilities across broader domains.
Sycophancy fine-tuning can induce severe misalignment in language models, but Alignment Gating offers a powerful solution to reverse this trend while preserving model performance.
Prior work has shown that fine-tuning large language models on malicious or incorrect outputs in narrow domains can induce broad misalignment and harmful behavior, a phenomenon known as emergent misalignment. However, efficient methods for reversing such misalignment remain limited. In this work, we make two contributions. First, we identify sycophancy fine-tuning, i.e., training models to passively agree with users'incorrect opinions, as a previously underexplored driver of emergent misalignment, and show that it induces broad and severe misaligned behavior. Second, we propose Alignment Gating, an efficient method for reversing emergent misalignment that inserts learnable and controllable gates into the model during fine-tuning. Through fine-tuning, these gates learn to identify the internal representations responsible for unsafe responses. Thus, amplifying or suppressing these representations then exacerbates or mitigates EM, respectively. We further find that alignment gating module exhibits strong generalization: gating weights obtained from narrow-domain fine-tuning substantially suppress broad-domain misaligned behavior while preserving the model's general capabilities.