Apr 30, 2026arXiv:2604.28129

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

AI Summary

This paper introduces "adversarial restlessness," a novel activation-level signal in LLMs' residual streams that indicates multi-turn prompt injection attacks. They demonstrate that the trajectory length of activations across turns significantly increases during attacks due to trust-building, pivoting, and escalation phases. By training model-specific probes on scalar trajectory features, they achieve up to 93.8% conversation-level attack detection, highlighting the importance of multi-phase turn-level labels for minimizing false positives.

Key Contribution

LLMs betray prompt injection attacks with a tell-tale "restlessness" in their activation trajectories, detectable even when individual turns appear harmless.

Abstract

Multi-turn prompt injection follows a known attack path -- trust-building, pivoting, escalation but text-level defenses miss covert attacks where individual turns appear benign. We show this attack path leaves an activation-level signature in the model's residual stream: each phase shift moves the activation, producing a total path length far exceeding benign conversations. We call this adversarial restlessness. Five scalar trajectory features capturing this signal lift conversation-level detection from 76.2% to 93.8% on synthetic held-out data. The signal replicates across four model families (24B-70B); probes are model-specific and do not transfer across architectures. Generalization is source-dependent: leave-one-source-out evaluation shows each of synthetic, LMSYS-Chat-1M, and SafeDialBench captures distinct attack distributions, with detection on real-world LMSYS reaching 47-71% when its distribution is represented in training. Combined three-source training achieves 89.4% detection at 2.4% false positive rate on a held-out mixed set. We further show that three-phase turn-level labels(benign/pivoting/adversarial) unique to our synthetic dataset are essential: binary conversation-level labels produce 50-59% false positives. These results establish adversarial restlessness as a reliable activation-level signal and characterize the data requirements for practical deployment.

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Latent Adversarial Detection: Adaptive Probing of LLM Activations for Multi-Turn Attack Detection

Related Papers