Google ResearchChongqing Ant Consumer Finance Co.ColumbiaCornellKeiji AINIHNTUNYUPittPKUUIUCUT AustinUT Southwestern Medical CenterSep 24, 2025

A foundation model for human-AI collaboration in medical literature mining

Zifeng Wang, Lang Cao, Qiao Jin, Joey Chan, Nicholas Wan, Behdad Afzali, Hyun-Jin Cho, C. Choi, Mehdi Emamverdi, Manjot K Gill, Sun-Hyung Kim, Yijia Li, Yi Liu, Yiming Luo, Hanley Ong, Justin F. Rousseau, Irfan Sheikh, Jenny J. Wei, Ziyang Xu, Christopher M. Zallek, Kyungsang Kim, Yifan Peng, Zhiyong Lu, Jimeng Sun

AI Summary

The authors introduce LEADS, a domain-specific foundation model for medical literature mining, trained on a large, curated dataset of systematic reviews, clinical trials, and registries. LEADS outperforms four cutting-edge LLMs on six literature mining tasks, including study search, screening, and data extraction. A user study with clinicians and researchers showed that LEADS improves recall and accuracy while saving time in study selection and data extraction tasks, demonstrating its potential to enhance expert productivity.

Key Contribution

Clinicians using a new medical literature mining LLM, LEADS, achieved 0.81 recall vs. 0.78 without it, while saving 20.8% of their time.

Abstract

Applying artificial intelligence (AI) for systematic literature review holds great potential for enhancing evidence-based medicine, yet has been limited by insufficient training and evaluation. Here, we present LEADS, an AI foundation model trained on 633,759 samples curated from 21,335 systematic reviews, 453,625 clinical trial publications, and 27,015 clinical trial registries. In experiments, LEADS demonstrates consistent improvements over four cutting-edge large language models (LLMs) on six literature mining tasks, e.g., study search, screening, and data extraction. We conduct a user study with 16 clinicians and researchers from 14 institutions to assess the utility of LEADS integrated into the expert workflow. In study selection, experts using LEADS achieve 0.81 recall vs. 0.78 without, saving 20.8% time. For data extraction, accuracy reached 0.85 vs. 0.80, with 26.9% time savings. These findings encourage future work on leveraging high-quality domain data to build specialized LLMs that outperform generic models and enhance expert productivity in literature mining. Literature mining, such as systematic review and meta-analysis, is crucial for discovering, integrating, and interpreting emerging research. This study presents a specialized large language model for literature that outperforms six general LLMs and helps clinicians in study selection and data extraction tasks.

Eval Frameworks & Benchmarks Natural Language Processing Scientific Discovery & Drug Design

Citation Metrics

Citations3

Influential citations0

References71

Year2025

VenueNature Communications

Related Papers

Finding related papers...

Search

A foundation model for human-AI collaboration in medical literature mining

Related Papers