Apr 30, 2026arXiv:2604.27470

HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats

Rebecca Soskin Hicks, M. Trofimov, Mikhail Trofimov, Dominick Lim, D. Lim, Rahul K. Arora, Foivos Tsimpourlas, Preston Bowman, Michael Sharman, Chi Tong, Chio-Kin Tong, Kavin Karthik, K. Karthik, Arnav Dugar, Akshay Jagadeesh, Khaled Saab, Johannes Heidecke, Ashley Alexander, Nate Gross, Karan Singhal

AI Summary

HealthBench Professional, a new benchmark, evaluates LLMs on real-world clinical tasks via physician-authored conversations with ChatGPT across care consult, documentation, and research use cases. The benchmark includes adversarial examples and is scored using rubrics adjudicated by multiple physicians. Surprisingly, GPT-5.4 in ChatGPT for Clinicians outperforms both base GPT-5.4 and human physicians on this benchmark.

Key Contribution

ChatGPT for Clinicians, not human doctors, currently achieves the highest scores on a new benchmark of real-world clinical LLM tasks.

Abstract

Millions of clinicians use ChatGPT to support clinical care, but evaluations of the most common use cases in model-clinician conversations are limited. We introduce HealthBench Professional, an open benchmark for evaluating large language models on real tasks that clinicians bring to ChatGPT in the course of their work. The benchmark is organized around three common use cases central to clinical practice: care consult, writing and documentation, and medical research. Each example includes a physician-authored conversation with ChatGPT for Clinicians and is scored via rubrics written and iteratively adjudicated by three or more physicians across three phases. HealthBench Professional examples were carefully selected for quality, representativeness, and difficulty for OpenAI's current frontier models, to enable continued measurement of progress. Difficult examples for recent OpenAI models were enriched by roughly 3.5 times relative to the candidate pool of 15,079 examples. Additionally, about one-third of examples involve physicians conducting deliberate adversarial testing of models. As a strong baseline, we also collected human physician responses for all tasks (unbounded time, specialist-matched, web access). The best scoring system, GPT-5.4 in ChatGPT for Clinicians, outperforms base GPT-5.4, all other models, and human physicians. We hope HealthBench Professional provides the healthcare AI community a measure to track frontier model progress in real-world clinical tasks and build systems that clinicians can trust to improve care.

Eval Frameworks & Benchmarks Natural Language Processing Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

HealthBench Professional: Evaluating Large Language Models on Real Clinician Chats

Related Papers