Mar 1, 2026arXiv:2603.01326

Truth as a Trajectory: What Internal Representations Reveal About Large Language Model Reasoning

Hamed Damirchi, Ignacio Meza De la Jara, Ehsan Abbasnejad, Afshar Shamsi, Javen Shi

AI Summary

The paper introduces Truth as a Trajectory (TaT), a novel method for LLM explainability that analyzes the geometric displacement of hidden state representations across layers to model inference as an iterative refinement process. TaT addresses the limitations of static activation analysis, which often captures surface-level lexical patterns due to polysemantic features in hidden states. The study demonstrates that TaT, by focusing on representational trajectories, outperforms conventional probing techniques in benchmarks spanning commonsense reasoning, question answering, and toxicity detection, without relying on the activations themselves.

Key Contribution

By tracking how LLM activations *move* through layers, this new method reveals the hidden geometry of reasoning, outperforming standard probing techniques that treat activations as static snapshots.

Abstract

Existing explainability methods for Large Language Models (LLMs) typically treat hidden states as static points in activation space, assuming that correct and incorrect inferences can be separated using representations from an individual layer. However, these activations are saturated with polysemantic features, leading to linear probes learning surface-level lexical patterns rather than underlying reasoning structures. We introduce Truth as a Trajectory (TaT), which models the transformer inference as an unfolded trajectory of iterative refinements, shifting analysis from static activations to layer-wise geometric displacement. By analyzing displacement of representations across layers, TaT uncovers geometric invariants that distinguish valid reasoning from spurious behavior. We evaluate TaT across dense and Mixture-of-Experts (MoE) architectures on benchmarks spanning commonsense reasoning, question answering, and toxicity detection. Without access to the activations themselves and using only changes in activations across layers, we show that TaT effectively mitigates reliance on static lexical confounds, outperforming conventional probing, and establishes trajectory analysis as a complementary perspective on LLM explainability.

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Truth as a Trajectory: What Internal Representations Reveal About Large Language Model Reasoning

Related Papers