Michigan StateUQFeb 19, 2026arXiv:2602.17170

When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment

Chuting Yu, Hang Li, Joel Mackenzie, Teerapong Leelanupab

AI Summary

The paper investigates the reliability of LLMs as substitutes for human judges in information retrieval relevance assessment, focusing on overrating behavior. Through systematic experiments varying model backbones, evaluation paradigms (pointwise and pairwise), and passage modification strategies, the authors demonstrate a consistent tendency for LLMs to assign inflated relevance scores to irrelevant passages. This sensitivity to passage length and lexical cues suggests a systemic bias, questioning the direct substitutability of LLMs for human assessors.

Key Contribution

LLMs aren't ready to replace human judges in relevance assessment, as they consistently inflate relevance scores and are easily swayed by superficial cues like passage length.

Abstract

Human relevance assessment is time-consuming and cognitively intensive, limiting the scalability of Information Retrieval evaluation. This has led to growing interest in using large language models (LLMs) as proxies for human judges. However, it remains an open question whether LLM-based relevance judgments are reliable, stable, and rigorous enough to match humans for relevance assessment. In this work, we conduct a systematic study of overrating behavior in LLM-based relevance judgments across model backbones, evaluation paradigms (pointwise and pairwise), and passage modification strategies. We show that models consistently assign inflated relevance scores -- often with high confidence -- to passages that do not genuinely satisfy the underlying information need, revealing a system-wide bias rather than random fluctuations in judgment. Furthermore, controlled experiments show that LLM-based relevance judgments can be highly sensitive to passage length and surface-level lexical cues. These results raise concerns about the usage of LLMs as drop-in replacements for human relevance assessors, and highlight the urgent need for careful diagnostic evaluation frameworks when applying LLMs for relevance assessments. Our code and results are publicly available.

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

When LLM Judges Inflate Scores: Exploring Overrating in Relevance Assessment

Related Papers