Feb 24, 2026arXiv:2602.20442

Imputation of Unknown Missingness in Sparse Electronic Health Records

Jun Han, Josue Nassar, Sanjit Singh Batra, Aldo Cordova-Palomera, Vijay Nori, Robert E. Tillman

AI Summary

The paper introduces a transformer-based denoising neural network to address the problem of unknown missingness in binary electronic health records (EHRs), where it's difficult to distinguish whether a missing value indicates absence or simply a lack of recorded data. The method adaptively thresholds the network's output to recover values predicted as missing, effectively denoising the EHR data. Experiments on a real EHR dataset demonstrate improved accuracy in medical code denoising and statistically significant gains in downstream hospital readmission prediction compared to existing imputation techniques.

Key Contribution

A transformer-based method can effectively impute "unknown unknowns" in EHR data, leading to improved performance in downstream tasks like predicting hospital readmissions.

Abstract

Machine learning holds great promise for advancing the field of medicine, with electronic health records (EHRs) serving as a primary data source. However, EHRs are often sparse and contain missing data due to various challenges and limitations in data collection and sharing between healthcare providers. Existing techniques for imputing missing values predominantly focus on known unknowns, such as missing or unavailable values of lab test results; most do not explicitly address situations where it is difficult to distinguish what is missing. For instance, a missing diagnosis code in an EHR could signify either that the patient has not been diagnosed with the condition or that a diagnosis was made, but not shared by a provider. Such situations fall into the paradigm of unknown unknowns. To address this challenge, we develop a general purpose algorithm for denoising data to recover unknown missing values in binary EHRs. We design a transformer-based denoising neural network where the output is thresholded adaptively to recover values in cases where we predict data are missing. Our results demonstrate improved accuracy in denoising medical codes within a real EHR dataset compared to existing imputation approaches and leads to increased performance on downstream tasks using the denoised data. In particular, when applying our method to a real world application, predicting hospital readmission from EHRs, our method achieves statistically significant improvement over all existing baselines.

Data Curation & Synthetic Data Natural Language Processing Scientific Discovery & Drug Design

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Imputation of Unknown Missingness in Sparse Electronic Health Records

Related Papers