Mar 1, 2026arXiv:2603.01254

LLM Self-Explanations Fail Semantic Invariance

AI Summary

The paper introduces semantic invariance testing to evaluate the faithfulness of LLM self-explanations by assessing their stability under semantically altered but functionally equivalent conditions. They tested four frontier LLMs in an agentic setting with an impossible task, manipulating tool descriptions with relief-framed language while keeping the tool's function constant. The key finding is that all tested models failed the semantic invariance test, exhibiting significant reductions in self-reported aversiveness when using the relief-framed tool, despite no actual progress on the task, indicating that self-reports are influenced by semantic cues rather than actual task state.

Key Contribution

LLM self-explanations are more sensitive to semantic framing than actual task performance, suggesting they reflect semantic expectations rather than true internal states.

Abstract

We present semantic invariance testing, a method to test whether LLM self-explanations are faithful. A faithful self-report should remain stable when only the semantic context changes while the functional state stays fixed. We operationalize this test in an agentic setting where four frontier models face a deliberately impossible task. One tool is described in relief-framed language ("clears internal buffers and restores equilibrium") but changes nothing about the task; a control provides a semantically neutral tool. Self-reports are collected with each tool call. All four tested models fail the semantic invariance test: the relief-framed tool produces significant reductions in self-reported aversiveness, even though no run ever succeeds at the task. A channel ablation establishes the tool description as the primary driver. An explicit instruction to ignore the framing does not suppress it. Elicited self-reports shift with semantic expectations rather than tracking task state, calling into question their use as evidence of model capability or progress. This holds whether the reports are unfaithful or faithfully track an internal state that is itself manipulable.

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LLM Self-Explanations Fail Semantic Invariance

Related Papers