NJUUT AustinApr 6, 2026arXiv:2604.04325

Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction

Jinrui Fang, Runhan Chen, Xu Yang, Jian Yu, Jiawei Xu, Ashwin Vinod, Wenqi Shi, Tianlong Chen, Heng Ji, Ying Ding, Yuji Zhang

AI Summary

The paper introduces MINT, a new benchmark for multi-turn medical diagnosis comprising 1,035 cases with clinically labeled evidence shards, to evaluate LLMs' diagnostic reasoning under incremental evidence accumulation. Through systematic evaluation of 11 LLMs, the study identifies three key behavioral patterns: premature answering, self-correction, and susceptibility to strong lures, which significantly impact diagnostic accuracy. The authors demonstrate that deferring the diagnostic question and reserving salient clinical evidence for later turns can substantially improve accuracy, offering actionable guidance for enhancing LLM reliability in medical diagnosis.

Key Contribution

LLMs in medical diagnosis are alarmingly prone to jumping to conclusions, often answering before seeing all the evidence, but strategically delaying the question and evidence presentation can boost accuracy by up to 62.6%.

Abstract

Large language models (LLMs) achieve high accuracy in medical diagnosis when all clinical information is provided in a single turn, yet how they behave under multi-turn evidence accumulation closer to real clinical reasoning remains unexplored. We introduce MINT (Medical Incremental N-Turn Benchmark), a high-fidelity, multi-turn medical diagnosis benchmark comprising 1,035 cases with clinically labeled evidence shards, controlled turn granularity, and information-preserving decomposition. Through systematic evaluation of 11 LLMs on MINT, we uncover three persistent behavioral patterns that significantly impact diagnostic decisions: (1) intent to answer, models rush to answer before sufficient evidence has been observed, with over 55% of answers committed within the first two turns; (2) self-correction, incorrect-to-correct answer revisions occur at up to 10.6 times the rate of correct-to-incorrect flips, revealing a latent capacity for self-correction that premature commitment forecloses; and (3) strong lures, clinically salient information such as laboratory results trigger premature answering even when models are explicitly instructed to wait. We translate these findings into clinically actionable guidance: deferring the diagnostic question to later turns reduces premature answering and improves accuracy at the first point of commitment by up to 62.6%, while reserving salient clinical evidence for later turns prevents a catastrophic accuracy drop of up to 23.3% caused by premature commitment. Our work provides both a controlled evaluation framework and concrete recommendations for improving the reliability of LLMs in multi-turn medical diagnosis.

Eval Frameworks & Benchmarks Natural Language Processing Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Benchmarking Multi-turn Medical Diagnosis: Hold, Lure, and Self-Correction

Related Papers