Search papers, labs, and topics across Lattice.
This paper investigates how LLMs compute verbal confidence scores, examining whether they are computed just-in-time or cached, and whether they represent token probabilities or a richer evaluation of answer quality. Through activation steering, patching, noising, and attention blocking experiments on Gemma 3 27B and Qwen 2.5 7B, the authors find evidence that confidence representations are cached at answer-adjacent positions and retrieved for output. Furthermore, linear probing reveals that these cached representations capture more variance in verbal confidence than token log-probabilities alone, indicating a more sophisticated self-evaluation process.
LLMs don't just regurgitate token probabilities when expressing confidence; they perform a more sophisticated, cached self-evaluation of answer quality.
Verbal confidence -- prompting LLMs to state their confidence as a number or category -- is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed - just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents - token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B and Qwen 2.5 7B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation -- not post-hoc reconstruction -- with implications for understanding metacognition in LLMs and improving calibration.