FudanHKUNJUPolyUShanghai AI LabApr 20, 2026arXiv:2604.18418

MedProbeBench: Systematic Benchmarking at Deep Evidence Integration for Expert-level Medical Guideline

Jiyao Liu, Jianghan Shen, Sida Song, Tianbin Li, Xiaojia Liu, Rongbin Li, Ziyan Huang, Jiashi Lin, Junzhi Ning, Changkai Ji, Siqi Luo, Wenjie Li, Chenglong Ma, Jing Xiong, Jin Ye, Bin Fu, Ningsheng Xu, Yirong Chen, Lei Jin, Hong Chen, Junjun He

AI Summary

MedProbeBench, a new benchmark, is introduced to evaluate deep research systems' ability to integrate evidence and generate clinical guidelines, using high-quality clinical guidelines as expert-level references. The benchmark is paired with MedProbe-Eval, a comprehensive evaluation framework featuring holistic rubrics and fine-grained evidence verification. Evaluation of 17 LLMs and deep research agents reveals significant shortcomings in evidence integration and guideline generation compared to expert-level performance.

Key Contribution

LLMs are still far from being able to generate expert-level clinical guidelines, despite advances in deep research systems.

Abstract

Recent advances in deep research systems enable large language models to retrieve, synthesize, and reason over large-scale external knowledge. In medicine, developing clinical guidelines critically depends on such deep evidence integration. However, existing benchmarks fail to evaluate this capability in realistic workflows requiring multi-step evidence integration and expert-level judgment. To address this gap, we introduce MedProbeBench, the first benchmark leveraging high-quality clinical guidelines as expert-level references. Medical guidelines, with their rigorous standards in neutrality and verifiability, represent the pinnacle of medical expertise and pose substantial challenges for deep research agents. For evaluation, we propose MedProbe-Eval, a comprehensive evaluation framework featuring: (1) Holistic Rubrics with 1,200+ task-adaptive rubric criteria for comprehensive quality assessment, and (2) Fine-grained Evidence Verification for rigorous validation of evidence precision, grounded in 5,130+ atomic claims. Evaluation of 17 LLMs and deep research agents reveals critical gaps in evidence integration and guideline generation, underscoring the substantial distance between current capabilities and expert-level clinical guideline development. Project: https://github.com/uni-medical/MedProbeBench

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MedProbeBench: Systematic Benchmarking at Deep Evidence Integration for Expert-level Medical Guideline

Related Papers