Search papers, labs, and topics across Lattice.
This paper investigates bias in speech-based cognitive impairment (CI) and depression detection using the DementiaBank Pitt Corpus, comparing traditional acoustic features (MFCCs, eGeMAPS) with Wav2Vec 2.0 (W2V2) embeddings. While W2V2 embeddings achieve higher accuracy in CI detection (UAR up to 80.6%), they exhibit significant performance disparities across gender and age subgroups, with females and younger participants showing lower discriminative power and specificity. The study highlights the need for fairness-aware evaluation and subgroup-specific analysis to address representational biases in clinical speech applications.
Wav2Vec 2.0, despite boosting accuracy in speech-based cognitive impairment detection, introduces significant gender and age biases, misclassifying females and younger individuals at a higher rate.
Speech-based detection of cognitive impairment (CI) offers a promising non-invasive approach for early diagnosis, yet performance disparities across demographic and clinical subgroups remain underexplored, raising concerns around fairness and generalizability. This study presents a systematic bias analysis of acoustic-based CI and depression classification using the DementiaBank Pitt Corpus. We compare traditional acoustic features (MFCCs, eGeMAPS) with contextualized speech embeddings from Wav2Vec 2.0 (W2V2), and evaluate classification performance across gender, age, and depression-status subgroups. For CI detection, higher-layer W2V2 embeddings outperform baseline features (UAR up to 80.6\%), but exhibit performance disparities; specifically, females and younger participants demonstrate lower discriminative power (\(AUC\): 0.769 and 0.746, respectively) and substantial specificity disparities (\(\Delta_{spec}\) up to 18\% and 15\%, respectively), leading to a higher risk of misclassifications than their counterparts. These disparities reflect representational biases, defined as systematic differences in model performance across demographic or clinical subgroups. Depression detection within CI subjects yields lower overall performance, with mild improvements from low and mid-level W2V2 layers. Cross-task generalization between CI and depression classification is limited, indicating that each task depends on distinct representations. These findings emphasize the need for fairness-aware model evaluation and subgroup-specific analysis in clinical speech applications, particularly in light of demographic and clinical heterogeneity in real-world applications.