HarvardRIKENSheffieldTohokuApr 13, 2026arXiv:2604.11662

Hidden Failures in Robustness: Why Supervised Uncertainty Quantification Needs Better Evaluation

Joe Stacey, Hadas Orgad, Kentaro Inui, Benjamin Heinzerling, Nafise Sadat Moosavi

AI Summary

This paper benchmarks the robustness of supervised uncertainty probes for LLMs across diverse models, tasks, and out-of-distribution (OOD) settings. The study reveals that current uncertainty probes exhibit poor robustness, especially for long-form generations, and that this is more sensitive to probe inputs than model architecture. They find that middle-layer representations and token aggregation strategies yield more robust uncertainty estimates under distribution shift, and propose a hybrid back-off strategy to improve robustness.

Key Contribution

Uncertainty estimates from LLMs can crumble under distribution shift, but the right probe design – think middle layers and token aggregation – can make them surprisingly resilient.

Abstract

Recent work has shown that the hidden states of large language models contain signals useful for uncertainty estimation and hallucination detection, motivating a growing interest in efficient probe-based approaches. Yet it remains unclear how robust existing methods are, and which probe designs provide uncertainty estimates that are reliable under distribution shift. We present a systematic study of supervised uncertainty probes across models, tasks, and OOD settings, training over 2,000 probes while varying the representation layer, feature type, and token aggregation strategy. Our evaluation highlights poor robustness in current methods, particularly in the case of long-form generations. We also find that probe robustness is driven less by architecture and more by the probe inputs. Middle-layer representations generalise more reliably than final-layer hidden states, and aggregating across response tokens is consistently more robust than relying on single-token features. These differences are often largely invisible in-distribution but become more important under distribution shift. Informed by our evaluation, we explore a simple hybrid back-off strategy for improving robustness, arguing that better evaluation is a prerequisite for building more robust probes.

Eval Frameworks & Benchmarks Natural Language Processing Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Hidden Failures in Robustness: Why Supervised Uncertainty Quantification Needs Better Evaluation

Related Papers