Search papers, labs, and topics across Lattice.
The paper investigates the robustness of safety classifiers used with instruction-tuned language models, revealing a significant vulnerability to small embedding drifts. They demonstrate that normalized perturbations of embeddings with magnitude σ=0.02 can drastically reduce classifier ROC-AUC from 85% to 50%, despite only a minor drop in mean confidence. This leads to a high rate of confident misclassifications, highlighting a critical flaw in current safety architectures.
Safety classifiers for LLMs can catastrophically fail with even minuscule embedding drift, creating dangerous blind spots in deployed safety architectures.
Instruction tuned reasoning models are increasingly deployed with safety classifiers trained on frozen embeddings, assuming representation stability across model updates. We systematically investigate this assumption and find it fails: normalized perturbations of magnitude $σ=0.02$ (corresponding to $\approx 1^\circ$ angular drift on the embedding sphere) reduce classifier performance from $85\%$ to $50\%$ ROC-AUC. Critically, mean confidence only drops $14\%$, producing dangerous silent failures where $72\%$ of misclassifications occur with high confidence, defeating standard monitoring. We further show that instruction-tuned models exhibit 20$\%$ worse class separability than base models, making aligned systems paradoxically harder to safeguard. Our findings expose a fundamental fragility in production AI safety architectures and challenge the assumption that safety mechanisms transfer across model versions.