Feb 17, 2026arXiv:2602.15799

The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety

Max Springer, Chung Peng Lee, Blossom Metevier, Jane Castleman, Bohdan Turbal, Hayoung Jung, Zeyu Shen, Aleksandra Korolova

AI Summary

The paper investigates the phenomenon of alignment collapse, where fine-tuning aligned language models on benign tasks unexpectedly degrades safety guardrails. They demonstrate that the common assumption of orthogonality between fine-tuning updates and safety-critical directions is unstable under gradient descent due to the concentration of alignment in low-dimensional subspaces with sharp curvature. Through a novel geometric analysis, they derive an Alignment Instability Condition and a quartic scaling law showing that alignment loss grows with the fourth power of training time, highlighting the limitations of current safety paradigms that focus on initial model states.

Key Contribution

Fine-tuning can unexpectedly break safety guardrails because alignment concentrates in brittle, low-dimensional subspaces, causing gradient descent to steer models into alignment-sensitive regions despite initial orthogonality.

Abstract

Fine-tuning aligned language models on benign tasks unpredictably degrades safety guardrails, even when training data contains no harmful content and developers have no adversarial intent. We show that the prevailing explanation, that fine-tuning updates should be orthogonal to safety-critical directions in high-dimensional parameter space, offers false reassurance: we show this orthogonality is structurally unstable and collapses under the dynamics of gradient descent. We then resolve this through a novel geometric analysis, proving that alignment concentrates in low-dimensional subspaces with sharp curvature, creating a brittle structure that first-order methods cannot detect or defend. While initial fine-tuning updates may indeed avoid these subspaces, the curvature of the fine-tuning loss generates second-order acceleration that systematically steers trajectories into alignment-sensitive regions. We formalize this mechanism through the Alignment Instability Condition, three geometric properties that, when jointly satisfied, lead to safety degradation. Our main result establishes a quartic scaling law: alignment loss grows with the fourth power of training time, governed by the sharpness of alignment geometry and the strength of curvature coupling between the fine-tuning task and safety-critical parameters. These results expose a structural blind spot in the current safety paradigm. The dominant approaches to safe fine-tuning address only the initial snapshot of a fundamentally dynamic problem. Alignment fragility is not a bug to be patched; it is an intrinsic geometric property of gradient descent on curved manifolds. Our results motivate the development of curvature-aware methods, and we hope will further enable a shift in alignment safety analysis from reactive red-teaming to predictive diagnostics for open-weight model deployment.

Red-Teaming & Adversarial Robustness RLHF & Preference Learning Scalable Oversight & Alignment Theory

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Geometry of Alignment Collapse: When Fine-Tuning Breaks Safety

Related Papers