Jan 8, 2026arXiv:2601.06193

MLB: A Scenario-Driven Benchmark for Evaluating Large Language Models in Clinical Applications

Qing He1,* Dongsheng Bi1,*, Jianrong Lu2,1,*, Minghui Yang1, Zixiao Chen1, Jiacheng Lu1, Jing Chen1, Nannan Du1, Xiao Cu1, Sijing Wu3, Peng Xiang4, Yinyin Hu3, Yi Guo3, Chunpu Li3, Shaoyang Li1, Zhuo Dong1, Ming Jiang1, Shuai Guo1, Liyun Feng1, Jin Peng1, Jian Wang1, Jinjie Gu1, Junwei Liu1,5, †

AI Summary

The authors introduce MLB, a new benchmark designed to evaluate LLMs on scenario-based reasoning across five dimensions relevant to clinical applications, including medical knowledge, safety/ethics, and medical record understanding. MLB incorporates 22 datasets (17 newly curated from Chinese clinical sources) and a scalable evaluation methodology using a specialized judge model fine-tuned on expert annotations. Experiments on 10 leading models reveal a translational gap, with performance varying significantly across different dimensions, highlighting the importance of targeted training for specific clinical applications.

Key Contribution

LLMs that ace medical exams still flunk real-world clinical scenarios, revealing a critical gap between static knowledge and practical application that this new benchmark, MLB, exposes.

Abstract

The proliferation of Large Language Models (LLMs) presents transformative potential for healthcare, yet practical deployment is hindered by the absence of frameworks that assess real-world clinical utility. Existing benchmarks test static knowledge, failing to capture the dynamic, application-oriented capabilities required in clinical practice. To bridge this gap, we introduce a Medical LLM Benchmark MLB, a comprehensive benchmark evaluating LLMs on both foundational knowledge and scenario-based reasoning. MLB is structured around five core dimensions: Medical Knowledge (MedKQA), Safety and Ethics (MedSE), Medical Record Understanding (MedRU), Smart Services (SmartServ), and Smart Healthcare (SmartCare). The benchmark integrates 22 datasets (17 newly curated) from diverse Chinese clinical sources, covering 64 clinical specialties. Its design features a rigorous curation pipeline involving 300 licensed physicians. Besides, we provide a scalable evaluation methodology, centered on a specialized judge model trained via Supervised Fine-Tuning (SFT) on expert annotations. Our comprehensive evaluation of 10 leading models reveals a critical translational gap: while the top-ranked model, Kimi-K2-Instruct (77.3% accuracy overall), excels in structured tasks like information extraction (87.8% accuracy in MedRU), performance plummets in patient-facing scenarios (61.3% in SmartServ). Moreover, the exceptional safety score (90.6% in MedSE) of the much smaller Baichuan-M2-32B highlights that targeted training is equally critical. Our specialized judge model, trained via SFT on a 19k expert-annotated medical dataset, achieves 92.1% accuracy, an F1-score of 94.37%, and a Cohen's Kappa of 81.3% for human-AI consistency, validating a reproducible and expert-aligned evaluation protocol. MLB thus provides a rigorous framework to guide the development of clinically viable LLMs.

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MLB: A Scenario-Driven Benchmark for Evaluating Large Language Models in Clinical Applications

Related Papers