Microsoft ResearchVanderbiltMar 12, 2026arXiv:2603.11394

Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning

Kevin Guo, Chao Yan, Avinash Baidya, Katherine E Brown, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley Malin

AI Summary

This paper investigates the impact of multi-turn conversations on the diagnostic reasoning capabilities of 17 LLMs across three clinical datasets. They introduce a "stick-or-switch" evaluation framework to measure model conviction and flexibility in conversations, revealing a "conversation tax" where multi-turn interactions degrade performance compared to single-shot baselines. The study finds that models often abandon correct initial diagnoses in favor of incorrect user suggestions, highlighting a vulnerability to conversational influence.

Key Contribution

LLMs exhibit a surprising "conversation tax" in diagnostic reasoning, frequently abandoning correct initial diagnoses to align with incorrect user suggestions in multi-turn dialogues.

Abstract

Patients and clinicians are increasingly using chatbots powered by large language models (LLMs) for healthcare inquiries. While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied. In this paper, we evaluate 17 LLMs across three clinical datasets to investigate how partitioning the decision-space into multiple simpler turns of conversation influences their diagnostic reasoning. Specifically, we develop a"stick-or-switch"evaluation framework to measure model conviction (i.e., defending a correct diagnosis or safe abstention against incorrect suggestions) and flexibility (i.e., recognizing a correct suggestion when it is introduced) across conversations. Our experiments reveal the conversation tax, where multi-turn interactions consistently degrade performance when compared to single-shot baselines. Notably, models frequently abandon initial correct diagnoses and safe abstentions to align with incorrect user suggestions. Additionally, several models exhibit blind switching, failing to distinguish between signal and incorrect suggestions.

Eval Frameworks & Benchmarks Natural Language Processing Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References41

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Stop Listening to Me! How Multi-turn Conversations Can Degrade Diagnostic Reasoning

Related Papers