Apr 20, 2026arXiv:2604.18901

Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

AI Summary

This paper demonstrates that harmful intent can be geometrically recovered from the residual streams of various LLMs, including Qwen, Llama, and Gemma families, across different alignment variants. They find that harmful intent is represented as a linear direction in most layers, achieving high AUROC scores (up to 0.98) using direction-finding strategies like soft-AUC optimization and class-mean probing, and as angular deviation in others. Notably, detection remains stable even in models with surgically removed refusal behavior, suggesting a dissociation between harmful intent and refusal.

Key Contribution

Even after surgically removing refusal behavior from LLMs, a stable, linearly decodable representation of harmful intent persists in their residual streams.

Abstract

Harmful intent is geometrically recoverable from large language model residual streams: as a linear direction in most layers, and as angular deviation in layers where projection methods fail. Across 12 models spanning four architectural families (Qwen2.5, Qwen3.5, Llama-3.2, Gemma-3) and three alignment variants (base, instruction-tuned, abliterated), under single-turn, English evaluation, we characterise this geometry through six direction-finding strategies. Three succeed: a soft-AUC-optimised linear direction reaches mean AUROC 0.98 and TPR@1\%FPR 0.80; a class-mean probe reaches 0.98 and 0.71 at<1ms fitting cost; a supervised angular-deviation strategy reaches AUROC 0.96 and TPR of 0.61 along a representationally distinct direction ($73^\circ$ from projection-based solutions), uniquely sustaining detection in middle layers where projection methods collapse. Detection remains stable across alignment variants, including abliterated models from which refusal has been surgically removed: harmful intent and refusal behaviour are functionally dissociated features of the representation. A direction fitted on AdvBench transfers to held-out HarmBench and JailbreakBench with worst-case AUROC 0.96. The same picture holds at scale: across Qwen3.5 from 0.8B to 9B parameters, AUROC remains $\geq$0.98 and cross-variant transfer stays within 0.018 of own-direction performance This is consistent with a simple account: models acquire a linearly decodable representation of harmful intent as part of general language understanding, and alignment then shapes what they do with such inputs without reorganising the upstream recognition signal. As a practical consequence, AUROC in the 0.97+ regime can substantially overestimate operational detectability; TPR@$1\%$FPR should accompany AUROC in safety-adjacent evaluation.

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Harmful Intent as a Geometrically Recoverable Feature of LLM Residual Streams

Related Papers