University of the Basque Country (EHU)Mar 30, 2026arXiv:2603.28167

Automating Early Disease Prediction Via Structured and Unstructured Clinical Data

Ane G Domingo-Aldama, Marcos Merino Prado, Alain García Olea, Josu Goikoetxea, Koldo Gojenola, Aitziber Atutxa

AI Summary

This paper introduces an automated pipeline for early disease prediction that leverages NLP to extract information from unstructured clinical discharge reports. The pipeline automates cohort selection, dataset generation, and outcome labeling, addressing the limitations of relying solely on structured EHR data. Results show that models trained on data enriched with discharge report information outperform models trained only on structured EHR data in predicting atrial fibrillation progression.

Key Contribution

Unlock hidden predictive power: NLP on unstructured clinical notes beats traditional EHR data for early disease prediction.

Abstract

This study presents a fully automated methodology for early prediction studies in clinical settings, leveraging information extracted from unstructured discharge reports. The proposed pipeline uses discharge reports to support the three main steps of early prediction: cohort selection, dataset generation, and outcome labeling. By processing discharge reports with natural language processing techniques, we can efficiently identify relevant patient cohorts, enrich structured datasets with additional clinical variables, and generate high-quality labels without manual intervention. This approach addresses the frequent issue of missing or incomplete data in codified electronic health records (EHR), capturing clinically relevant information that is often underrepresented. We evaluate the methodology in the context of predicting atrial fibrillation (AF) progression, showing that predictive models trained on datasets enriched with discharge report information achieve higher accuracy and correlation with true outcomes compared to models trained solely on structured EHR data, while also surpassing traditional clinical scores. These results demonstrate that automating the integration of unstructured clinical text can streamline early prediction studies, improve data quality, and enhance the reliability of predictive models for clinical decision-making.

Data Curation & Synthetic Data Natural Language Processing Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Automating Early Disease Prediction Via Structured and Unstructured Clinical Data

Related Papers