Search papers, labs, and topics across Lattice.
This paper introduces PhysAssistBench, a benchmark designed to evaluate the interactive assistance capabilities of medical LLMs in real-world doctor-patient-EHR scenarios, constructed from actual MIMIC-IV cases. The study highlights that current leading LLMs struggle to provide reliable assistance due to their inability to effectively coordinate clinical knowledge, patient communication, and EHR system interactions within a single interaction. The findings reveal a significant bottleneck in the development of clinical LLMs, emphasizing the need for integrated capabilities rather than isolated improvements.
Current LLMs falter in delivering reliable medical assistance, exposing a critical gap in their ability to coordinate knowledge, communication, and EHR interactions.
The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assistance instead requires coordinating these capabilities within the same interaction, where physicians issue underspecified requests, patients describe symptoms ambiguously, and EHR systems demand precise tool use. We introduce PhysAssistBench, a benchmark for interactive doctor-patient-EHR assistance. Built from real MIMIC-IV cases, PhysAssistBench uses a scalable pipeline to construct agentic patients: interactive, record-grounded agents that turn static EHR records into multi-turn clinical scenarios while preserving clinical factuality. PhysAssistBench provides a curated bilingual evaluation set of 1,296 manually reviewed and physician-validated turns. Experiments with leading LLMs show that current models remain unreliable in this setting, which exposes a key bottleneck for clinical LLMs: reliable assistance requires coordination across knowledge, communication, and systems, not isolated gains in any of them.