Search papers, labs, and topics across Lattice.
This paper investigates counterfactual unfairness in LLMs by analyzing their responses to humor, specifically focusing on how responses change when speaker and addressee identities are swapped. The study spans humor generation refusal, speaker intention inference, and social impact prediction, using both identity-agnostic and disparagement humor. Results show significant relational disparities, with jokes from privileged speakers being more accepted and less likely to be judged as malicious, revealing a complex interplay of sensitivity and stereotyping in LLMs.
LLMs exhibit stark relational biases in humor, refusing jokes from marginalized speakers up to 67.5% more often and judging them as more malicious, revealing a hidden dimension of unfairness beyond simple stereotyping.
Humor holds up a mirror to social perception: what we find funny often reflects who we are and how we judge others. When language models engage with humor, their reactions expose the social assumptions they have internalized from training data. In this paper, we investigate counterfactual unfairness through humor by observing how the model's responses change when we swap who speaks and who is addressed while holding other factors constant. Our framework spans three tasks: humor generation refusal, speaker intention inference, and relational/societal impact prediction, covering both identity-agnostic humor and identity-specific disparagement humor. We introduce interpretable bias metrics that capture asymmetric patterns under identity swaps. Experiments across state-of-the-art models reveal consistent relational disparities: jokes told by privileged speakers are refused up to 67.5% more often, judged as malicious 64.7% more frequently, and rated up to 1.5 points higher in social harm on a 5-point scale. These patterns highlight how sensitivity and stereotyping coexist in generative models, complicating efforts toward fairness and cultural alignment.