Michigan StateQMULXJTUApr 7, 2026arXiv:2604.05397

Confidence Should Be Calibrated More Than One Turn Deep

Zhaohan Zhang, Chengzhengxu Li, Xiaoming Liu, Ziquan Liu, Ioannis Patras

AI Summary

This paper introduces the task of multi-turn calibration for LLMs, highlighting the degradation of calibration over multiple turns due to user feedback. They propose MTCal, a method that minimizes Expected Calibration Error at turn T (ECE@T) using a surrogate calibration target, and ConfChat, a decoding strategy leveraging calibrated confidence. Experiments demonstrate that MTCal improves multi-turn calibration, while ConfChat enhances factuality and consistency in multi-turn interactions.

Key Contribution

User interaction can break your LLM's confidence calibration, but this new method can fix it.

Abstract

Large Language Models (LLMs) are increasingly applied in high-stakes domains such as finance, healthcare, and education, where reliable multi-turn interactions with users are essential. However, existing work on confidence estimation and calibration, a major approach to building trustworthy LLM systems, largely focuses on single-turn settings and overlooks the risks and potential of multi-turn conversations. In this work, we introduce the task of multi-turn calibration to reframe calibration from a static property into a dynamic challenge central to reliable multi-turn conversation, where calibrating model confidence at each turn conditioned on the conversation history is required. We first reveal the risks of this setting: using Expected Calibration Error at turn T (ECE@T), a new metric that tracks calibration dynamics over turns, we show that user feedback (e.g., persuasion) can degrade multi-turn calibration. To address this, we propose MTCal, which minimises ECE@T via a surrogate calibration target, and further leverage calibrated confidence in ConfChat, a decoding strategy that improves both factuality and consistency of the model response in multi-turn interactions. Extensive experiments demonstrate that MT-Cal achieves outstanding and consistent performance in multi-turn calibration, and ConfChat preserves and even enhances model performance in multi-turn interactions. Our results mark multi-turn calibration as one missing link for scaling LLM calibration toward safe, reliable, and real-world use.

Eval Frameworks & Benchmarks Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References48

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Confidence Should Be Calibrated More Than One Turn Deep

Related Papers