ELLISJun 11, 2026arXiv:2606.13221

From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation

AI Summary

This paper introduces a novel approach to evaluating large language models (LLMs) by utilizing calibrated Elo estimation to address systematic errors in LLM-derived rankings. By applying a two-level analysis that combines local uncertainty estimation through calibrated win probabilities and global conformal prediction, the authors significantly improve the accuracy of LLM ratings, achieving a mean absolute error of only 17.9 Elo points compared to human-derived ratings across 55 models. This method offers a cost-effective alternative to traditional human annotation, providing developers with reliable rankings and uncertainty estimates for LLM performance evaluation.

Key Contribution

LLM-derived rankings can now achieve near-human accuracy with a fraction of the cost, thanks to a new method that quantifies and calibrates uncertainty in evaluations.

Abstract

Evaluating new large language models typically requires costly human annotation campaigns at scale. LLM-as-a-judge offers a cheaper alternative, but judge scores carry systematic errors - such as position bias, self-preference, or intransitivity - that can strongly miscalibrate the resulting rankings. We quantify the resulting judge-human disagreement at two complementary levels. At the local level, we estimate per-battle uncertainty from the judge's own score differences by propagating calibrated win probabilities rather than hard labels into the Bradley-Terry procedure. This alone provides a drastic improvement to Elo estimation accuracy, bringing LLM-derived ratings within 17.9 Elo MAE of human-derived ones when averaged over 55 held-out models on LMArena. At the global level, we apply split conformal prediction to the residual gap between LLM-derived and human-derived Elo ratings across held-out models, producing prediction intervals with distribution-free marginal coverage guarantees that account for irreducible LLM-human disagreement. Together, these two layers yield a low-cost evaluation tool that provides developers with calibrated Elo estimates and honest uncertainty bounds, without access to large-scale human annotations.To facilitate reproducibility, we release our code at https://github.com/kargibora/SoftElo .

Eval Frameworks & Benchmarks RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From Uncertain Judgments to Calibrated Rankings: Conformal Elo Estimation for LLM Evaluation

Related Papers