Feb 23, 2026arXiv:2602.20031

Latent Introspection: Models Can Detect Prior Concept Injections

Theia Pearson-Vogel, Martin Vanek, Raymond Douglas, Jan Kulveit

AI Summary

This paper demonstrates that a Qwen 32B model possesses a latent capacity for introspection, specifically the ability to detect and identify concepts injected into its earlier context, even while denying such injections in generated text. Using logit lens analysis on the residual stream, the authors identify clear detection signals that diminish in later layers. Prompting the model with information about AI introspection mechanisms significantly enhances its sensitivity to concept injection, increasing detection rates from 0.3% to 39.2% while maintaining a low false positive rate.

Key Contribution

LLMs may already possess surprisingly strong self-awareness of concept manipulation, detectable via mechanistic interpretability techniques, even when they deny it in their outputs.

Abstract

We uncover a latent capacity for introspection in a Qwen 32B model, demonstrating that the model can detect when concepts have been injected into its earlier context and identify which concept was injected. While the model denies injection in sampled outputs, logit lens analysis reveals clear detection signals in the residual stream, which are attenuated in the final layers. Furthermore, prompting the model with accurate information about AI introspection mechanisms can dramatically strengthen this effect: the sensitivity to injection increases massively (0.3% -> 39.2%) with only a 0.6% increase in false positives. Also, mutual information between nine injected and recovered concepts rises from 0.62 bits to 1.05 bits, ruling out generic noise explanations. Our results demonstrate models can have a surprising capacity for introspection and steering awareness that is easy to overlook, with consequences for latent reasoning and safety.

Interpretability & Mechanistic Interp Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Latent Introspection: Models Can Detect Prior Concept Injections

Related Papers