Search papers, labs, and topics across Lattice.
This paper investigates physicians' perceptions of LLM capabilities in clinical reasoning to understand trust calibration in AI-assisted diagnosis. The study involved presenting clinical cases to physicians (N=37), collecting their evaluations of LLM-generated analyses, and comparing these perceptions with benchmark performance. The results reveal discrepancies between benchmark performance and physician-perceived value, highlighting the limitations of current evaluation metrics and informing strategies for building trustworthy LLM-physician collaboration.
Physicians' trust in LLMs for diagnosis hinges on reasoning aspects not captured by standard benchmarks, revealing a critical gap in current evaluation practices.
Large language models (LLMs) have shown considerable potential in supporting medical diagnosis. However, their effective integration into clinical workflows is hindered by physicians'difficulties in perceiving and trusting LLM capabilities, which often results in miscalibrated trust. Existing model evaluations primarily emphasize standardized benchmarks and predefined tasks, offering limited insights into clinical reasoning practices. Moreover, research on human-AI collaboration has rarely examined physicians'perceptions of LLMs'clinical reasoning capability. In this work, we investigate how physicians perceive LLMs'capabilities in the clinical reasoning process. We designed clinical cases, collected the corresponding analyses, and obtained evaluations from physicians (N=37) to quantitatively represent their perceived LLM diagnostic capabilities. By comparing the perceived evaluations with benchmark performance, our study highlights the aspects of clinical reasoning that physicians value and underscores the limitations of benchmark-based evaluation. We further discuss the implications of opportunities for enhancing trustworthy collaboration between physicians and LLMs in LLM-supported clinical reasoning.