Kansai UniversityRIKENShiga UniversityApr 30, 2026arXiv:2604.27345

LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human--LLM Judgment Gaps

Keito Inoshita, Xiaokang Zhou, Akira Kawai, K. Yada, Katsutoshi Yada

AI Summary

This paper investigates whether LLMs capture the nuanced distribution of human emotion judgments, rather than just majority labels, by comparing LLM-generated emotion distributions to human annotation distributions on GoEmotions and EmoBank. Results show that zero-shot LLMs diverge significantly from human distributions, with in-domain fine-tuning being crucial for bridging this gap, and that LLMs struggle with pragmatically complex emotions lacking explicit lexical markers. The authors also introduce post-hoc calibration methods that reduce the distributional gap by up to 14%.

Key Contribution

LLMs reliably capture emotions with explicit lexical markers, but systematically fail on pragmatically complex emotions requiring contextual inference, revealing a critical limitation in their ability to understand nuanced human emotion.

Abstract

Human annotators frequently disagree on emotion labels, yet most evaluations of Large Language Model (LLM) emotion annotation collapse these judgments into a single gold standard, discarding the distributional information that disagreement encodes. We ask whether LLMs capture the structure of this disagreement, not just majority labels, by comparing emotion judgment distributions between human annotators and four zero-shot LLMs, plus a fine-tuned RoBERTa baseline, across two complementary benchmarks: GoEmotions and EmoBank, totaling 640,000 LLM responses. Zero-shot models diverge substantially from human distributions, and in-domain fine-tuning, not model scale, is required to close the gap. We formalize a lexical-grounding gradient through a quantitative transparency score that predicts per-category human--LLM agreement: LLMs reliably capture emotions with explicit lexical markers but systematically fail on pragmatically complex emotions requiring contextual inference, a pattern that replicates across both categorical and continuous emotion frameworks. We further propose three lightweight post-hoc calibration methods that reduce the distributional gap by up to 14\%, and provide actionable guidelines for when LLM emotion annotations can, and cannot, substitute for human labeling.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LLMs Capture Emotion Labels, Not Emotion Uncertainty: Distributional Analysis and Calibration of Human--LLM Judgment Gaps

Related Papers