Search papers, labs, and topics across Lattice.
This paper introduces a diagnostic toolkit to evaluate the reliability of LLM-as-judge frameworks for NLG evaluation, using transitivity analysis and split conformal prediction sets. They find widespread per-input inconsistency in LLM judgements masked by low aggregate violation rates, and demonstrate that prediction set width serves as a reliable per-instance indicator of judgement reliability. Their analysis across four judges and criteria reveals that relevance is judged most reliably, while fluency and consistency are the least reliable.
LLM judges are far less reliable on individual examples than aggregate metrics suggest: up to 67% of documents show judgment inconsistencies, and some criteria like fluency are essentially unjudgeable.
LLM-as-judge frameworks are increasingly used for automatic NLG evaluation, yet their per-instance reliability remains poorly understood. We present a two-pronged diagnostic toolkit applied to SummEval: $\textbf{(1)}$ a transitivity analysis that reveals widespread per-input inconsistency masked by low aggregate violation rates ($\bar{\rho} = 0.8$-$4.1\%$), with $33$-$67\%$ of documents exhibiting at least one directed 3-cycle; and $\textbf{(2)}$ split conformal prediction sets over 1-5 Likert scores providing theoretically-guaranteed $\geq(1{-}\alpha)$ coverage, with set width serving as a per-instance reliability indicator ($r_s = {+}0.576$, $N{=}1{,}918$, $p<10^{-100}$, pooled across all judges). Critically, prediction set width shows consistent cross-judge agreement ($\bar{r} = 0.32$-$0.38$), demonstrating it captures document-level difficulty rather than judge-specific noise. Across four judges and four criteria, both diagnostics converge: criterion matters more than judge, with relevance judged most reliably (avg. set size $\approx 3.0$) and coherence moderately so (avg. set size $\approx 3.9$), while fluency and consistency remain unreliable (avg. set size $\approx 4.9$). We release all code, prompts, and cached results.