Feb 22, 2026arXiv:2602.19101

Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

Seong Hah Cho, Junyi Li, Anna Leshinskaya

AI Summary

The paper investigates whether LLMs distinguish between moral, grammatical, and economic notions of "good," finding evidence of "value entanglement" where these distinct value representations are conflated. Specifically, grammatical and economic valuations are overly influenced by moral value, deviating from human norms. Selective ablation of morality-associated activation vectors successfully mitigated this conflation, suggesting a potential intervention strategy.

Key Contribution

LLMs don't just have values, they mix them up, letting morality bleed into grammar and economics in ways that deviate from human norms.

Abstract

Value alignment of Large Language Models (LLMs) requires us to empirically measure these models' actual, acquired representation of value. Among the characteristics of value representation in humans is that they distinguish among value of different kinds. We investigate whether LLMs likewise distinguish three different kinds of good: moral, grammatical, and economic. By probing model behavior, embeddings, and residual stream activations, we report pervasive cases of value entanglement: a conflation between these distinct representations of value. Specifically, both grammatical and economic valuation was found to be overly influenced by moral value, relative to human norms. This conflation was repaired by selective ablation of the activation vectors associated with morality.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Value Entanglement: Conflation Between Different Kinds of Good In (Some) Large Language Models

Related Papers