Mar 1, 2026arXiv:2603.01297

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

AI Summary

The paper investigates the robustness of safety classifiers used with instruction-tuned language models, revealing a significant vulnerability to small embedding drifts. They demonstrate that normalized perturbations of embeddings with magnitude σ=0.02 can drastically reduce classifier ROC-AUC from 85% to 50%, despite only a minor drop in mean confidence. This leads to a high rate of confident misclassifications, highlighting a critical flaw in current safety architectures.

Key Contribution

Safety classifiers for LLMs can catastrophically fail with even minuscule embedding drift, creating dangerous blind spots in deployed safety architectures.

Abstract

Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $σ=0.02$ (corresponding to $\approx 1^\circ$ angular drift on the embedding sphere) reduce classifier performance from $85\%$ to $50\%$ ROC-AUC. Critically, mean confidence only drops $14\%$, producing dangerous silent failures where $72\%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20$\%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

I Can't Believe It's Not Robust: Catastrophic Collapse of Safety Classifiers under Embedding Drift

Related Papers