Feb 19, 2026arXiv:2602.16980

Discovering Universal Activation Directions for PII Leakage in Language Models

Leo Marchyok, Zachary Coalson, Sungho Keum, Sooel Son, Sanghyun Hong

AI Summary

The paper introduces UniLeak, a mechanistic interpretability framework to discover universal activation directions within language model residual streams that consistently increase the likelihood of PII generation across diverse prompts. UniLeak identifies these directions without training data or ground truth PII, relying solely on self-generated text to find directions that amplify PII generation while preserving generation quality. Experiments across multiple models and datasets demonstrate that steering along these discovered directions significantly boosts PII leakage compared to prompt-based methods, revealing a latent signal for PII leakage within model representations.

Key Contribution

Language models harbor hidden "PII leakage knobs" – universal activation directions that, when tweaked, dramatically increase the generation of sensitive personal information.

Abstract

Modern language models exhibit rich internal structure, yet little is known about how privacy-sensitive behaviors, such as personally identifiable information (PII) leakage, are represented and modulated within their hidden states. We present UniLeak, a mechanistic-interpretability framework that identifies universal activation directions: latent directions in a model's residual stream whose linear addition at inference time consistently increases the likelihood of generating PII across prompts. These model-specific directions generalize across contexts and amplify PII generation probability, with minimal impact on generation quality. UniLeak recovers such directions without access to training data or groundtruth PII, relying only on self-generated text. Across multiple models and datasets, steering along these universal directions substantially increases PII leakage compared to existing prompt-based extraction methods. Our results offer a new perspective on PII leakage: the superposition of a latent signal in the model's representations, enabling both risk amplification and mitigation.

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Discovering Universal Activation Directions for PII Leakage in Language Models

Related Papers