AaltoEqual supervisionHITPolyUTencent AITU MunichJun 17, 2026arXiv:2606.18613

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

Tianming Du, Peijie Yu, Sihan Shang, Danli Shi, My Linh Nguyen, Shengbo Gao, Guangyuan Li, Yinghong Yu, Yan Jiang, Qianlong Zhao, Behzad Bozorgtabar, Shaoxiong Ji, Jiazhen Pan, Daniel Rueckert, Jiancheng Yang

AI Summary

This paper introduces PhysAssistBench, a benchmark designed to evaluate the interactive assistance capabilities of medical LLMs in real-world doctor-patient-EHR scenarios, constructed from actual MIMIC-IV cases. The study highlights that current leading LLMs struggle to provide reliable assistance due to their inability to effectively coordinate clinical knowledge, patient communication, and EHR system interactions within a single interaction. The findings reveal a significant bottleneck in the development of clinical LLMs, emphasizing the need for integrated capabilities rather than isolated improvements.

Key Contribution

Current LLMs falter in delivering reliable medical assistance, exposing a critical gap in their ability to coordinate knowledge, communication, and EHR interactions.

Abstract

The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assistance instead requires coordinating these capabilities within the same interaction, where physicians issue underspecified requests, patients describe symptoms ambiguously, and EHR systems demand precise tool use. We introduce PhysAssistBench, a benchmark for interactive doctor-patient-EHR assistance. Built from real MIMIC-IV cases, PhysAssistBench uses a scalable pipeline to construct agentic patients: interactive, record-grounded agents that turn static EHR records into multi-turn clinical scenarios while preserving clinical factuality. PhysAssistBench provides a curated bilingual evaluation set of 1,296 manually reviewed and physician-validated turns. Experiments with leading LLMs show that current models remain unreliable in this setting, which exposes a key bottleneck for clinical LLMs: reliable assistance requires coordination across knowledge, communication, and systems, not isolated gains in any of them.

Eval Frameworks & Benchmarks Natural Language Processing Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

Related Papers