Google ResearchInstitute of PhilosophyJoint last authors.NorthwesternSFIUChicagoMar 30, 2026arXiv:2603.28925

Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

Junsol Kim, Winnie Street, R. Rocca, Daine M. Korngiebel, A. Waytz, James Evans, Geoff Keeling

AI Summary

This paper investigates the relationship between safety fine-tuning, self-attribution of mentality, and Theory of Mind (ToM) in LLMs. Through safety ablation and representational similarity analysis, the authors demonstrate a behavioral and mechanistic dissociation between LLM self-attributions and ToM capabilities. However, they also find that safety fine-tuning leads to under-attribution of mind to non-human animals and reduced spiritual belief, suggesting unintended consequences of safety interventions on broader worldviews.

Key Contribution

Safety fine-tuning might inadvertently be stripping LLMs of their ability to understand non-human minds and entertain spiritual beliefs, even while preserving Theory of Mind.

Abstract

Safety fine-tuning in Large Language Models (LLMs) seeks to suppress potentially harmful forms of mind-attribution such as models asserting their own consciousness or claiming to experience emotions. We investigate whether suppressing mind-attribution tendencies degrades intimately related socio-cognitive abilities such as Theory of Mind (ToM). Through safety ablation and mechanistic analyses of representational similarity, we demonstrate that LLM attributions of mind to themselves and to technological artefacts are behaviorally and mechanistically dissociable from ToM capabilities. Nevertheless, safety fine-tuned models under-attribute mind to non-human animals relative to human baselines and are less likely to exhibit spiritual belief, suppressing widely shared perspectives regarding the distribution and nature of non-human minds.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Theory of Mind and Self-Attributions of Mentality are Dissociable in LLMs

Related Papers