Search papers, labs, and topics across Lattice.
This paper investigates the mechanisms behind AI introspection, specifically how models detect injected representations. It replicates and extends the thought injection detection paradigm, revealing two distinct mechanisms: probability-matching (inference based on prompt anomaly) and direct access to internal states. The key finding is that the direct access mechanism is content-agnostic, allowing models to detect anomalies without understanding their semantic content, leading to confabulation of high-frequency concepts.
AI models can detect injected thoughts, but they often have no idea *what* those thoughts are, relying on content-agnostic anomaly detection and then guessing common concepts.
Introspection is a foundational cognitive ability, but its mechanism is not well understood. Recent work has shown that AI models can introspect. We study their mechanism of introspection, first extensively replicating Lindsey et al. (2025)'s thought injection detection paradigm in large open-source models. We show that these models detect injected representations via two separable mechanisms: (i) probability-matching (inferring from perceived anomaly of the prompt) and (ii) direct access to internal states. The direct access mechanism is content-agnostic: models detect that an anomaly occurred but cannot reliably identify its semantic content. The two model classes we study confabulate injected concepts that are high-frequency and concrete (e.g.,"apple'"); for them correct concept guesses typically require significantly more tokens. This content-agnostic introspective mechanism is consistent with leading theories in philosophy and psychology.