Southwest UMar 11, 2026arXiv:2603.11027

Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

AI Summary

This paper investigates the reliability of LLM-as-a-judge paradigms, revealing a phenomenon called "Evaluation Illusion" where high model-level agreement masks significant inconsistencies at the sample level due to reliance on surface heuristics. They demonstrate that shared rubric structure significantly inflates agreement scores and that high-quality outputs receive the least consistent evaluations. To address this, they introduce MERG, a framework for dynamically generating knowledge-grounded evaluation rubrics, showing that agreement increases in knowledge-rich domains and decreases in subjective domains, suggesting the need for expert knowledge in evaluation rubrics.

Key Contribution

LLM-as-a-judge consensus is often an illusion: models agree on surface-level features, but diverge wildly when evaluating true quality, a problem fixable by injecting domain knowledge into rubrics.

Abstract

The paradigm of LLM-as-a-judge relies on a critical assumption, namely that high inter-evaluator agreement indicates reliable and objective evaluation. We present two complementary findings that challenge this assumption. \textbf{First}, we demonstrate that this consensus is frequently illusory. We identify and formalize \textbf{Evaluation Illusion}, a phenomenon where LLM judges generate sophisticated critiques yet anchor scores on shared surface heuristics rather than substantive quality. Through a large-scale study of 105,600 evaluation instances (32 LLMs $\times$ 3 frontier judges $\times$ 100 tasks $\times$ 11 temperatures), we show that model-level agreement (Spearman $\rho = 0.99$) masks fragile sample-level agreement (Pearson $\bar{r} = 0.72$; absolute agreement ICC $= 0.67$), that merely sharing rubric structure restores 62\% of total agreement, and that high-quality outputs paradoxically receive the \textit{least} consistent evaluations. \textbf{Second}, we demonstrate that dynamically generating evaluation rubrics grounded in domain knowledge produces more meaningful assessment. We introduce MERG (Metacognitive Enhanced Rubric Generation), a knowledge-driven rubric generation framework whose domain-selective effects confirm this. Agreement \textit{increases} in codified domains (Education +22\%, Academic +27\%) where knowledge anchors evaluators on shared standards, while it decreases in subjective domains where genuine evaluative pluralism emerges. These findings suggest that evaluation rubrics should be dynamically enriched with expert knowledge rather than relying on generic criteria, with implications for reward modeling in RLAIF.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References38

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond the Illusion of Consensus: From Surface Heuristics to Knowledge-Grounded Evaluation in LLM-as-a-Judge

Related Papers