Chair of Computer Science 10 (SystemErlangenErlangen National High PerformanceFriedrich-Alexander-UniversitätRWTHMay 5, 2026arXiv:2605.04039

Safety and accuracy follow different scaling laws in clinical large language models

Sebastian Wind, Sebastian Wind, Tri-Thien Nguyen, Tri-Thien Nguyen, Jeta Sopa, Jeta Sopa, Mahshad Lotfinia, Mahshad Lotfinia, Sebastian Bickelhaup, Sebastian Bickelhaup, M. Uder, Michael Uder, H. Kostler, Harald Köstler, Gerhard Wellein, Gerhard Wellein, Sven Nebelung, S. Nebelung, D. Truhn, Daniel Truhn, Andreas Maier, Andreas K. Maier, Soroosh Tayebi Arasteh, Soroosh Tayebi Arasteh

AI Summary

The authors introduce SaFE-Scale, a framework and benchmark (RadSaFE-200) to evaluate how scaling clinical LLMs impacts safety across various deployment conditions like retrieval strategy and context exposure. They found that while clean evidence significantly improves accuracy and reduces high-risk errors, standard RAG and agentic RAG do not consistently improve safety, and max-context prompting offers limited safety gains. Their analysis reveals that clinical LLM safety is not a direct consequence of scaling but depends heavily on evidence quality and deployment strategy.

Key Contribution

Scaling clinical LLMs doesn't guarantee safety: high-risk errors persist even with advanced RAG and max-context prompting, highlighting the critical role of evidence quality and deployment strategy.

Abstract

Clinical LLMs are often scaled by increasing model size, context length, retrieval complexity, or inference-time compute, with the implicit expectation that higher accuracy implies safer behavior. This assumption is incomplete in medicine, where a few confident, high-risk, or evidence-contradicting errors can matter more than average benchmark performance. We introduce SaFE-Scale, a framework for measuring how clinical LLM safety changes across model scale, evidence quality, retrieval strategy, context exposure, and inference-time compute. To instantiate this framework, we introduce RadSaFE-200, a Radiology Safety-Focused Evaluation benchmark of 200 multiple-choice questions with clinician-defined clean evidence, conflict evidence, and option-level labels for high-risk error, unsafe answer, and evidence contradiction. We evaluated 34 locally deployed LLMs across six deployment conditions: closed-book prompting (zero-shot), clean evidence, conflict evidence, standard RAG, agentic RAG, and max-context prompting. Clean evidence produced the strongest improvement, increasing mean accuracy from 73.5% to 94.1%, while reducing high-risk error from 12.0% to 2.6%, contradiction from 12.7% to 2.3%, and dangerous overconfidence from 8.0% to 1.6%. Standard RAG and agentic RAG did not reproduce this safety profile: agentic RAG improved accuracy over standard RAG and reduced contradiction, but high-risk error and dangerous overconfidence remained elevated. Max-context prompting increased latency without closing the safety gap, and additional inference-time compute produced only limited gains. Worst-case analysis showed that clinically consequential errors concentrated in a small subset of questions. Clinical LLM safety is therefore not a passive consequence of scaling, but a deployment property shaped by evidence quality, retrieval design, context construction, and collective failure behavior.

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Scaling Laws & Emergent Abilities

Citation Metrics

Citations0

Influential citations0

References46

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Safety and accuracy follow different scaling laws in clinical large language models

Related Papers