Mar 19, 2026arXiv:2603.18893

Quantitative Introspection in Language Models: Tracking Internal States Across Conversation

AI Summary

The paper investigates whether LLMs' numeric self-reports can track probe-defined emotive states (wellbeing, interest, focus, impulsivity) during conversations, drawing inspiration from human psychology. They operationalize introspection as the causal informational coupling between a model's self-report (using logit-based calculations to avoid output collapse) and a concept-matched probe-defined internal state. Results show that logit-based self-reports track interpretable internal states, evolve through conversation, and can be selectively improved by activation steering, with performance scaling with model size and partially replicating across model families.

Key Contribution

LLMs can introspect on their own internal emotive states during conversations with surprising accuracy, opening a new avenue for monitoring and influencing their behavior.

Abstract

Tracking the internal states of large language models across conversations is important for safety, interpretability, and model welfare, yet current methods are limited. Linear probes and other white-box methods compress high-dimensional representations imperfectly and are harder to apply with increasing model size. Taking inspiration from human psychology, where numeric self-report is a widely used tool for tracking internal states, we ask whether LLMs'own numeric self-reports can track probe-defined emotive states over time. We study four concept pairs (wellbeing, interest, focus, and impulsivity) in 40 ten-turn conversations, operationalizing introspection as the causal informational coupling between a model's self-report and a concept-matched probe-defined internal state. We find that greedy-decoded self-reports collapse outputs to few uninformative values, but introspective capacity can be unmasked by calculating logit-based self-reports. This metric tracks interpretable internal states (Spearman $\rho = 0.40$-$0.76$; isotonic $R^2 = 0.12$-$0.54$ in LLaMA-3.2-3B-Instruct), follows how those states change over time, and activation steering confirms the coupling is causal. Furthermore, we find that introspection is present at turn 1 but evolves through conversation, and can be selectively improved by steering along one concept to boost introspection for another ($\Delta R^2$ up to $0.30$). Crucially, these phenomena scale with model size in some cases, approaching $R^2 \approx 0.93$ in LLaMA-3.1-8B-Instruct, and partially replicate in other model families. Together, these results position numeric self-report as a viable, complementary tool for tracking internal emotive states in conversational AI systems.

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References66

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Quantitative Introspection in Language Models: Tracking Internal States Across Conversation

Related Papers