Hippocratic AI;Mar 18, 2025

Real-World Evaluation of Large Language Models in Healthcare (RWE-LLM): A New Realm of AI Safety & Validation

MD MHA¹ Meenesh Bhimani, BS¹ Alex Miller, P. M. Jonathan D. Agnew, Markel Sanz Ausin, BA¹ Mariska Raglow-Defranco, MD Mba Harpreet Mangat, Bsn RN Ccm Michelle Voisard, RN Bsn Ccm Maggie Taylor, BS Sebastian Bierman-Lytle, BS BA Vishal Parikh, Juliana Ghukasyan, BS Rae Lasko, Saad Godil, MEng, MD Mph Ashish Atreja, PhD¹ Subhabrata Mukherjee

AI Summary

The authors introduce RWE-LLM, a novel framework for real-world safety validation of LLMs in healthcare, emphasizing output testing through large-scale clinician engagement. They evaluated a non-diagnostic AI Care Agent across four iterations, engaging over 6,000 US licensed clinicians in a three-tier review process. The results demonstrate substantial safety improvements, with correct medical advice rates increasing from ~80.0% to 99.38% and severe harm concerns being eliminated, establishing a practical model for AI safety in healthcare.

Key Contribution

Forget traditional LLM benchmarks: this study shows how a real-world, output-focused safety framework with thousands of clinicians can drive dramatic improvements in healthcare AI, slashing potential harm to near zero.

Abstract

Background: The deployment of artificial intelligence (AI) in healthcare necessitates robust safety validation frameworks, particularly for systems directly interacting with patients. While theoretical frameworks exist, there remains a critical gap between abstract principles and practical implementation. Traditional LLM benchmarking approaches provide very limited output coverage and are insufficient for healthcare applications requiring high safety standards. Objective: To develop and evaluate a comprehensive framework for healthcare AI safety validation through large-scale clinician engagement. Methods: We implemented the RWE-LLM (Real-World Evaluation of Large Language Models in Healthcare) framework, drawing inspiration from red teaming methodologies while expanding their scope to achieve comprehensive safety validation. Our approach emphasizes output testing rather than relying solely on input data quality across four stages: pre-implementation, tiered review, resolution, and continuous monitoring. We engaged 6,234 US licensed clinicians (5,969 nurses and 265 physicians) with an average of 11.5 years of clinical experience. The framework employed a three-tier review process for error detection and resolution, evaluating a non-diagnostic AI Care Agent focused on patient education, follow-ups, and administrative support across four iterations (pre-Polaris and Polaris 1.0, 2.0, and 3.0). Results: Over 307,000 unique calls were evaluated using the RWE-LLM framework. Each interaction was subject to potential error flagging across multiple severity categories, from minor clinical inaccuracies to significant safety concerns. The multi-tiered review system successfully processed all flagged interactions, with internal nursing reviews providing initial expert evaluation followed by physician adjudication when necessary. The framework demonstrated effective throughput in addressing identified safety concerns while maintaining consistent processing times and documentation standards. Systematic improvements in safety protocols were achieved through a continuous feedback loop between error identification and system enhancement. Performance metrics demonstrated substantial safety improvements between iterations, with correct medical advice rates improving from ~80.0% (pre-Polaris), to 96.79% (Polaris 1.0), to 98.75% (Polaris 2.0) and 99.38% (Polaris 3.0). Incorrect advice resulting in potential minor harm decreased from 1.32% to 0.13% and 0.07%, and severe harm concerns were eliminated (0.06% to 0.10% and 0.00%). Conclusions: The successful nationwide implementation of the RWE-LLM framework establishes a practical model for ensuring AI safety in healthcare settings. Our methodology demonstrates that comprehensive output testing provides significantly stronger safety assurance than traditional input validation approaches used by horizontal LLMs. While resource-intensive, this approach proves that rigorous safety validation for healthcare AI systems is both necessary and achievable, setting a benchmark for future deployments.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Citation Metrics

Citations7

Influential citations1

References35

Year2025

VenuemedRxiv

Related Papers

Finding related papers...

Search

Real-World Evaluation of Large Language Models in Healthcare (RWE-LLM): A New Realm of AI Safety & Validation

Related Papers