Search papers, labs, and topics across Lattice.
The paper introduces CACTUS, a machine learning framework designed for robust and interpretable classification in clinical datasets with missing data. CACTUS integrates feature abstraction, interpretable classification, and feature stability analysis to identify features that remain informative even as data quality degrades. Applied to a haematuria cohort, CACTUS demonstrated competitive predictive performance compared to random forests and gradient boosting, while exhibiting significantly higher stability of top-ranked features under increasing missingness.
Stop chasing peak performance on messy clinical data and start building trust: CACTUS delivers robust predictions with stable, interpretable features even when data goes missing.
Machine learning models are increasingly applied to biomedical data, yet their adoption in high stakes domains remains limited by poor robustness, limited interpretability, and instability of learned features under realistic data perturbations, such as missingness. In particular, models that achieve high predictive performance may still fail to inspire trust if their key features fluctuate when data completeness changes, undermining reproducibility and downstream decision-making. Here, we present CACTUS (Comprehensive Abstraction and Classification Tool for Uncovering Structures), an explainable machine learning framework explicitly designed to address these challenges in small, heterogeneous, and incomplete clinical datasets. CACTUS integrates feature abstraction, interpretable classification, and systematic feature stability analysis to quantify how consistently informative features are preserved as data quality degrades. Using a real-world haematuria cohort comprising 568 patients evaluated for bladder cancer, we benchmark CACTUS against widely used machine learning approaches, including random forests and gradient boosting methods, under controlled levels of randomly introduced missing data. We demonstrate that CACTUS achieves competitive or superior predictive performance while maintaining markedly higher stability of top-ranked features as missingness increases, including in sex-stratified analyses. Our results show that feature stability provides information complementary to conventional performance metrics and is essential for assessing the trustworthiness of machine learning models applied to biomedical data. By explicitly quantifying robustness to missing data and prioritising interpretable, stable features, CACTUS offers a generalizable framework for trustworthy data-driven decision support.