University of Colorado Anschutz Medical CampusJul 23, 2025

Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reporting

Aakriti Pandita, Angela Keniston, Nikhil Madhuripan

AI Summary

This study investigates the feasibility of using synthetic data to train open-source LLMs for converting free-text radiology reports into structured data, specifically ACR TI-RADS templates for thyroid nodules. Six open-source models (Starcoderbase-1B/3B, Mistral-7B, Llama-3-8B, Llama-2-13B, and Yi-34B) were fine-tuned on 3000 synthetically generated thyroid nodule dictations. Results showed that Yi-34B, fine-tuned on synthetic data, achieved performance comparable to GPT-4 5-shot, and several open-source models outperformed GPT models, suggesting a viable, privacy-preserving alternative to proprietary models.

Key Contribution

Forget expensive, proprietary LLMs: open-source models fine-tuned on synthetic data can match or beat GPT-4 in radiology reporting.

Abstract

The study assessed the feasibility of using synthetic data to fine-tune various open-source LLMs for free text to structured data conversation in radiology, comparing their performance with GPT models. A training set of 3000 synthetic thyroid nodule dictations was generated to train six open-source models (Starcoderbase-1B, Starcoderbase-3B, Mistral-7B, Llama-3-8B, Llama-2-13B, and Yi-34B). ACR TI-RADS template was the target model output. The model performance was tested on 50 thyroid nodule dictations from MIMIC-III patient dataset and compared against 0-shot, 1-shot, and 5-shot performance of GPT-3.5 and GPT-4. GPT-4 5-shot and Yi-34B showed the highest performance with no statistically significant difference between the models. Various open models outperformed GPT models with statistical significance. Overall, models trained with synthetic data showed performance comparable to GPT models in structured text conversion in our study. Given privacy preserving advantages, open LLMs can be utilized as a viable alternative to proprietary GPT models.

Data Curation & Synthetic Data Natural Language Processing Open-Source Models & Weights

Citation Metrics

Citations3

Influential citations1

References33

Year2025

Venuenpj Digital Medicine

Related Papers

Finding related papers...

Search

Synthetic data trained open-source language models are feasible alternatives to proprietary models for radiology reporting

Related Papers