Apr 22, 2026arXiv:2604.20331

Surrogate modeling for interpreting black-box LLMs in medical predictions

Changho Han, Songsoo Kim, Dong Won Kim, L. Celi, Jaewoong Kim, SungA Bae, Dukyong Yoon Medical Big Data Research Center, Seoul National University Medical Research Center, Seoul National University College of Medicine, Seoul., R. Korea., Department of Biomedical Informatics, Yonsei University College of Medicine, Laboratory for Computational Physiology, M. I. O. Technology, Cambridge, Ma., Usa, Division of Pulmonary, Critical Care, Sleep Medicine, Beth Israel Deaconess Medical Center, Boston, D. Biostatistics, H. T. C. S. O. P. Health, D. Cardiology, Yongin Severance Hospital, Yongin, Center for Digital Health, Yonsei University Health System, IN Healthcare, Severance Hospital

AI Summary

This paper introduces a surrogate modeling framework that interprets the latent knowledge encoded in large language models (LLMs) by approximating their input-output relationships through extensive prompting. By applying this framework to medical predictions, the authors quantitatively reveal how LLMs perceive input variables and uncover instances where LLMs propagate inaccuracies and biases, including scientifically refuted racial assumptions. The findings highlight the potential risks associated with deploying LLMs in sensitive domains, emphasizing the need for interpretability to ensure safe applications in healthcare settings.

Key Contribution

LLMs may encode dangerous biases and inaccuracies, revealing a critical need for interpretability in medical applications.

Abstract

Large language models (LLMs), trained on vast datasets, encode extensive real-world knowledge within their parameters, yet their black-box nature obscures the mechanisms and extent of this encoding. Surrogate modeling, which uses simplified models to approximate complex systems, can offer a path toward better interpretability of black-box models. We propose a surrogate modeling framework that quantitatively explains LLM-encoded knowledge. For a specific hypothesis derived from domain knowledge, this framework approximates the latent LLM knowledge space using observable elements (input-output pairs) through extensive prompting across a comprehensive range of simulated scenarios. Through proof-of-concept experiments in medical predictions, we demonstrate our framework's effectiveness in revealing the extent to which LLMs"perceive"each input variable in relation to the output. Particularly, given concerns that LLMs may perpetuate inaccuracies and societal biases embedded in their training data, our experiments using this framework quantitatively revealed both associations that contradict established medical knowledge and the persistence of scientifically refuted racial assumptions within LLM-encoded knowledge. By disclosing these issues, our framework can act as a red-flag indicator to support the safe and reliable application of these models.

Interpretability & Mechanistic Interp

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Surrogate modeling for interpreting black-box LLMs in medical predictions

Related Papers