Feb 22, 2026arXiv:2602.19159

Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM

AI Summary

This paper investigates how valence-related information (pain vs. pleasure) is represented and causally used within the Gemma-2-9B-it transformer model during a minimalist decision task. They use layer-wise linear probing, activation interventions (steering, patching/ablation), and dose-response quantification to map representational availability, test causal contributions, and quantify the effects of varying intensity. The key finding is that valence sign is linearly separable from early layers, graded intensity is decodable with peaks in mid-to-late layers, and additive steering along a valence direction causally modulates decision margins at late-layer attention outputs, distributed across multiple heads.

Key Contribution

LLMs encode pain vs. pleasure from the very first layer, and you can steer their choices by tweaking late-layer attention outputs.

Abstract

Prior behavioural work suggests that some LLMs alter choices when options are framed as causing pain or pleasure, and that such deviations can scale with stated intensity. To bridge behavioural evidence (what the model does) with mechanistic interpretability (what computations support it), we investigate how valence-related information is represented and where it is causally used inside a transformer. Using Gemma-2-9B-it and a minimalist decision task modelled on prior work, we (i) map representational availability with layer-wise linear probing across streams, (ii) test causal contribution with activation interventions (steering; patching/ablation), and (iii) quantify dose-response effects over an epsilon grid, reading out both the 2-3 logit margin and digit-pair-normalised choice probabilities. We find that (a) valence sign (pain vs. pleasure) is perfectly linearly separable across stream families from very early layers (L0-L1), while a lexical baseline retains substantial signal; (b) graded intensity is strongly decodable, with peaks in mid-to-late layers and especially in attention/MLP outputs, and decision alignment is highest slightly before the final token; (c) additive steering along a data-derived valence direction causally modulates the 2-3 margin at late sites, with the largest effects observed in late-layer attention outputs (attn_out L14); and (d) head-level patching/ablation suggests that these effects are distributed across multiple heads rather than concentrated in a single unit. Together, these results link behavioural sensitivity to identifiable internal representations and intervention-sensitive sites, providing concrete mechanistic targets for more stringent counterfactual tests and broader replication. This work supports a more evidence-driven (a) debate on AI sentience and welfare, and (b) governance when setting policy, auditing standards, and safety safeguards.

Architecture Design (Transformers, SSMs, MoE)Interpretability & Mechanistic Interp

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond Behavioural Trade-Offs: Mechanistic Tracing of Pain-Pleasure Decisions in an LLM

Related Papers