Mar 18, 2026arXiv:2603.17839

How do LLMs Compute Verbal Confidence

D. Kumaran, Dharshan Kumaran, Arthur Conmy, Federico Barbero, Simon Osindero, Viorica Patraucean, Petar Veličković, Petar Velickovic

AI Summary

This paper investigates how LLMs compute verbal confidence scores, examining whether they are computed just-in-time or cached, and whether they represent token probabilities or a richer evaluation of answer quality. Through activation steering, patching, noising, and attention blocking experiments on Gemma 3 27B and Qwen 2.5 7B, the authors find evidence that confidence representations are cached at answer-adjacent positions and retrieved for output. Furthermore, linear probing reveals that these cached representations capture more variance in verbal confidence than token log-probabilities alone, indicating a more sophisticated self-evaluation process.

Key Contribution

LLMs don't just regurgitate token probabilities when expressing confidence; they perform a more sophisticated, cached self-evaluation of answer quality.

Abstract

Verbal confidence -- prompting LLMs to state their confidence as a number or category -- is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed - just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents - token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B and Qwen 2.5 7B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation -- not post-hoc reconstruction -- with implications for understanding metacognition in LLMs and improving calibration.

Eval Frameworks & Benchmarks Interpretability & Mechanistic Interp Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

How do LLMs Compute Verbal Confidence

Related Papers