Search papers, labs, and topics across Lattice.
This paper investigates whether LLMs capture the nuanced distribution of human emotion judgments, rather than just majority labels, by comparing LLM-generated emotion distributions to human annotation distributions on GoEmotions and EmoBank. Results show that zero-shot LLMs diverge significantly from human distributions, with in-domain fine-tuning being crucial for bridging this gap, and that LLMs struggle with pragmatically complex emotions lacking explicit lexical markers. The authors also introduce post-hoc calibration methods that reduce the distributional gap by up to 14%.
LLMs reliably capture emotions with explicit lexical markers, but systematically fail on pragmatically complex emotions requiring contextual inference, revealing a critical limitation in their ability to understand nuanced human emotion.
Human annotators frequently disagree on emotion labels, yet most evaluations of Large Language Model (LLM) emotion annotation collapse these judgments into a single gold standard, discarding the distributional information that disagreement encodes. We ask whether LLMs capture the structure of this disagreement, not just majority labels, by comparing emotion judgment distributions between human annotators and four zero-shot LLMs, plus a fine-tuned RoBERTa baseline, across two complementary benchmarks: GoEmotions and EmoBank, totaling 640,000 LLM responses. Zero-shot models diverge substantially from human distributions, and in-domain fine-tuning, not model scale, is required to close the gap. We formalize a lexical-grounding gradient through a quantitative transparency score that predicts per-category human--LLM agreement: LLMs reliably capture emotions with explicit lexical markers but systematically fail on pragmatically complex emotions requiring contextual inference, a pattern that replicates across both categorical and continuous emotion frameworks. We further propose three lightweight post-hoc calibration methods that reduce the distributional gap by up to 14\%, and provide actionable guidelines for when LLM emotion annotations can, and cannot, substitute for human labeling.