Search papers, labs, and topics across Lattice.
This paper investigates the impact of multi-turn conversations on the diagnostic reasoning capabilities of 17 LLMs across three clinical datasets. They introduce a "stick-or-switch" evaluation framework to measure model conviction and flexibility in conversations, revealing a "conversation tax" where multi-turn interactions degrade performance compared to single-shot baselines. The study finds that models often abandon correct initial diagnoses in favor of incorrect user suggestions, highlighting a vulnerability to conversational influence.
LLMs exhibit a surprising "conversation tax" in diagnostic reasoning, frequently abandoning correct initial diagnoses to align with incorrect user suggestions in multi-turn dialogues.
Patients and clinicians are increasingly using chatbots powered by large language models (LLMs) for healthcare inquiries. While state-of-the-art LLMs exhibit high performance on static diagnostic reasoning benchmarks, their efficacy across multi-turn conversations, which better reflect real-world usage, has been understudied. In this paper, we evaluate 17 LLMs across three clinical datasets to investigate how partitioning the decision-space into multiple simpler turns of conversation influences their diagnostic reasoning. Specifically, we develop a"stick-or-switch"evaluation framework to measure model conviction (i.e., defending a correct diagnosis or safe abstention against incorrect suggestions) and flexibility (i.e., recognizing a correct suggestion when it is introduced) across conversations. Our experiments reveal the conversation tax, where multi-turn interactions consistently degrade performance when compared to single-shot baselines. Notably, models frequently abandon initial correct diagnoses and safe abstentions to align with incorrect user suggestions. Additionally, several models exhibit blind switching, failing to distinguish between signal and incorrect suggestions.