Search papers, labs, and topics across Lattice.
This paper introduces SHAPCA, a pipeline combining PCA for dimensionality reduction with SHAP values for explainability, specifically designed for spectroscopic data. SHAPCA addresses the challenge of high dimensionality and collinearity in spectroscopy data, which often leads to unstable and inconsistent explanations of ML model predictions. By applying PCA and then projecting SHAP values back into the original input space, SHAPCA provides more consistent and interpretable explanations, linking model predictions to specific spectral bands.
Unstable explanations plague ML models on spectroscopy data, but SHAPCA offers a more consistent and interpretable approach by combining PCA and SHAP values in the original input space.
In recent years, machine learning models have been increasingly applied to spectroscopic datasets for chemical and biomedical analysis. For their successful adoption, particularly in clinical and safety-critical settings, professionals and researchers must be able to understand and trust the reasoning behind model predictions. However, the inherently high dimensionality and strong collinearity of spectroscopy data pose a fundamental challenge to model explainability. These properties not only complicate model training but also undermine the stability and consistency of explanations, leading to fluctuations in feature importance across repeated training runs. Feature extraction techniques have been used to reduce the input dimensionality; these new features hinder the connection between the prediction and the original signal. This study proposes SHAPCA, an explainable machine learning pipeline that combines Principal Component Analysis (for dimensionality reduction) and Shapely Additive exPlanations (for post hoc explanation) to provide explanations in the original input space, which a practitioner can interpret and link back to the biological components. The proposed framework enables analysis from both global and local perspectives, revealing the spectral bands that drive overall model behaviour as well as the instance-specific features that influence individual predictions. Numerical analysis demonstrated the interpretability of the results and greater consistency across different runs.