UIUCApr 28, 2026arXiv:2604.25235

VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

Divake Kumar, Sina Tayebati, Devashri Naik, Ranganath Krishnan, A. Trivedi

AI Summary

This paper investigates the reliability of Vision-Language Models (VLMs) when used as automated judges for multimodal systems by applying conformal prediction to convert VLM scores into calibrated prediction intervals using score-token log-probabilities. The study reveals that evaluation uncertainty is highly task-dependent, with intervals covering 40% of the score range for aesthetics but expanding to 70% for chart and mathematical reasoning tasks. The authors also identify a ranking-scoring decoupling failure mode, where VLMs can accurately rank responses while assigning unreliable absolute scores, highlighting the importance of considering task difficulty and annotation quality when evaluating VLM judges.

Key Contribution

VLMs can ace the ranking but bomb the scoring, revealing a critical flaw in how we evaluate multimodal systems.

Abstract

Vision-language models (VLMs) are increasingly used as automated judges for multimodal systems, yet their scores provide no indication of reliability. We study this problem through conformal prediction, a distribution-free framework that converts a judge's point score into a calibrated prediction interval using only score-token log-probabilities, with no retraining. We present the first systematic analysis of conformal prediction for VLM-as-a-Judge across 3 judges and 14 visual task categories. Our results show that evaluation uncertainty is strongly task-dependent: intervals cover ~40% of the score range for aesthetics and natural images but expand to ~70% for chart and mathematical reasoning, yielding a quantitative reliability map for multimodal evaluation. We further identify a failure mode not captured by standard evaluation metrics, ranking-scoring decoupling, where judges achieve high ranking correlation while producing wide, uninformative intervals, correctly ordering responses but failing to assign reliable absolute scores. Finally, we show that interval width is driven primarily by task difficulty and annotation quality, i.e., the same judge and method yield 4.5x narrower intervals on a clean, multi-annotator captioning benchmark. Code: https://github.com/divake/VLM-Judge-Uncertainty

Computer Vision Eval Frameworks & Benchmarks Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VLM Judges Can Rank but Cannot Score: Task-Dependent Uncertainty in Multimodal Evaluation

Related Papers