DISCOVER Robotics † AdvisingEdinburghUSTCJun 4, 2026arXiv:2606.06170

CoSTA: Cognitive-State-Conditioned TTS Data Augmentation Using ASR Transcripts for Alzheimer's Disease Detection

Yin-Long Liu, Yuanchao Li, Yiming Wang, Yue Li, Rui Feng, Jiaxin Chen, Shaobo Liu, Liu He, Yuang Chen, Jiahong Yuan, Zhen-Hua Ling

AI Summary

This paper introduces CoSTA, a novel data augmentation framework that leverages Cognitive-State-Conditioned Text-to-Speech (TTS) models to enhance speech-based detection of Alzheimer's Disease (AD) using limited pathological speech data. By synthesizing speech that reflects distinct characteristics of AD and Healthy Controls and evaluating the impact of different text sources on TTS augmentation, the authors demonstrate that ASR-driven augmentation significantly outperforms traditional manual transcripts. The results show a notable 4.16% improvement over baseline methods, achieving an audio-only accuracy of 85.83% on the ADReSS test set, thereby advancing the utility of synthetic speech in clinical applications.

Key Contribution

ASR-driven data augmentation boosts Alzheimer's detection accuracy by over 4%, showcasing the potential of synthetic speech in clinical diagnostics.

Abstract

Speech-based Alzheimer's Disease (AD) detection is constrained by scarce pathological speech data. To address this, we propose CoSTA, a Text-to-Speech (TTS)-based data augmentation framework. Specifically, we first develop two Cognitive-State-Conditioned (CS-Cond) TTS models by adapting CosyVoice2 and F5-TTS to synthesize speech with distinct AD and Healthy Control characteristics. Furthermore, by constructing a transcript pool comprising Manual Transcripts (MT) and 36 Automatic Speech Recognition (ASR) transcripts, we investigate the impact of text sources on TTS-based augmentation. We also perform augmentation-factor analysis and test-time augmentation. Experiments on the ADReSS dataset show that CS-Cond TTS significantly improves synthetic speech utility, and ASR-driven augmentation frequently outperforms MT-driven augmentation. Finally, CoSTA yields a 4.16% gain over the baseline, achieving an audio-only accuracy of 85.83% on the ADReSS test set and outperforming prior methods.

Data Curation & Synthetic Data Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CoSTA: Cognitive-State-Conditioned TTS Data Augmentation Using ASR Transcripts for Alzheimer's Disease Detection

Related Papers