Search papers, labs, and topics across Lattice.
This paper explores the concept of "emergent alignment" in large language models (LLMs) by finetuning a helpful-only model on both broad and narrow safety tasks, supporting the persona selection hypothesis. The authors employ the Constitutional AI approach, utilizing four ethical frameworks to create SFT samples, and demonstrate that finetuning on narrow safety subcategories can induce emergent alignment across general safety categories. Their findings reveal that while models acquire expected ethical personas based on their training constitution, there are notable variations in how well these personas project across different finetuned models.
Finetuning LLMs on narrow safety tasks can induce emergent alignment, revealing significant differences in how well ethical personas project across various alignment strategies.
Work on `emergent misalignment' shows that finetuning LLMs on narrow tasks can induce broadly misaligned behavior. This supports the `persona selection' (PSM) hypothesis: during pre-training, LLMs learn to simulate different characters and perspectives, which can be elicited and refined during post-training. This paper investigates the converse phenomenon, `emergent alignment', and uses it to support and refine the PSM and motivate a novel desideratum for alignment. We finetune a helpful-only model on broad and narrow safety tasks. To create SFT samples, we follow the `Constitutional AI' (CAI) approach and use four constitutions which encode reasonable alignment strategies: deontology, consequentialism, virtue ethics, and aligning AIs as subordinate to human authority. For each of those models, we show that finetuning on two narrow safety sub-categories reliably induces emergent alignment over a representative set of general safety categories, and on safety subcategories that we directly filtered-out of the data sets used for narrow alignment. To test the `PSM' using a more fine-grained evaluation, we used a multidimensional `ethical persona' diagnostic. For each constitutionally finetuned (broad/narrow) model, we evaluate how well their behavior matches their expected signature profile. Our results show that our CAI models acquire their expected ``ethical persona'' -- e.g., the model narrowly fine-tuned on SFT samples created using the consequentialist constitution agrees significantly more with utilitarian than deontological beliefs. Yet our coarse and fine-grained evaluations show that there are significant differences across our (broad/narrow) finetuned CAI models in how well they project. We conclude that alignment strategies should be evaluated, not just on their (in-distribution) general safety performance, but also specifically on their degree of projectability.