Search papers, labs, and topics across Lattice.
This paper investigates whether LLMs, when used as automated evaluators, exhibit biases based on source labels. Through a counterfactual design, the study reveals that both humans and LLMs assign higher trust to content labeled as human-authored compared to the same content labeled as AI-generated. Analysis of LLM internal states shows that models allocate more attention to the label region than the content itself, mirroring human gaze patterns and suggesting a reliance on source labels as heuristic cues.
LLMs judging content aren't as objective as we thought: they're swayed by source labels just like humans, giving "human-authored" content an unfair trust advantage.
Large language models (LLMs) are increasingly used as automated evaluators (LLM-as-a-Judge). This work challenges its reliability by showing that trust judgments by LLMs are biased by disclosed source labels. Using a counterfactual design, we find that both humans and LLM judges assign higher trust to information labeled as human-authored than to the same content labeled as AI-generated. Eye-tracking data reveal that humans rely heavily on source labels as heuristic cues for judgments. We analyze LLM internal states during judgment. Across label conditions, models allocate denser attention to the label region than the content region, and this label dominance is stronger under Human labels than AI labels, consistent with the human gaze patterns. Besides, decision uncertainty measured by logits is higher under AI labels than Human labels. These results indicate that the source label is a salient heuristic cue for both humans and LLMs. It raises validity concerns for label-sensitive LLM-as-a-Judge evaluation, and we cautiously raise that aligning models with human preferences may propagate human heuristic reliance into models, motivating debiased evaluation and alignment.