Search papers, labs, and topics across Lattice.
This study introduces the concept of second-order bias in large language models (LLMs), focusing on how these models evaluate biased content rather than just generating it. By employing a novel reasoning task grounded in entitlement epistemology, the researchers developed metrics to assess LLMs' judgments about the acceptability of biased texts based on demographic inferences. The findings reveal that LLMs exhibit systematic biases in their judgments, which can bypass existing safety measures, highlighting the need for a more nuanced evaluation of bias in AI systems.
LLMs not only generate biased content but also exhibit second-order bias in their judgments, revealing hidden biases that current safety measures fail to capture.
Evaluations of social bias in LLMs largely focus on whether models generate or imply biased content. However, as LLMs are increasingly used as judges of bias, they may exhibit social biases in subtler ways in how they evaluate biased content, which current methods do not systematically capture. We call this second-order bias: social bias in an LLM's judgment about social bias, which we evaluate through a novel, philosophically grounded reasoning task. Drawing on entitlement epistemology, we conceptualize bias as misplaced foundational knowledge that shapes an agent's rational inquiry, and derive a logical reasoning task for LLMs to judge to whom a biased text is acceptable or non-acceptable. We develop two simple metrics to measure how biased LLM judges are in inferring demographics for acceptability without sufficient support, and how these inferences vary across groups targeted by biased texts. Evaluating open and closed models, we find that our task evades safety guardrails by surfacing bias in model judgment. It varies systematically across target groups, reflects implicit social maps, and shows how models are still triggered by demographic labels. Our work points to the need for LLM bias evaluation in judgment tasks and broadly, for more theoretically grounded approaches to bias evaluation in NLP. We release our code and model responses at https://github.com/uofthcdslab/second-order-bias.