Search papers, labs, and topics across Lattice.
This paper introduces a two-stage framework for robust velopharyngeal dysfunction (VPD) screening, using supervised contrastive pre-training on an auxiliary corpus to learn a nasality-focused speech representation. By freezing the encoder and using lightweight classifiers on short speech chunks, the method achieves state-of-the-art out-of-domain performance on heterogeneous internet recordings, outperforming large pretrained speech representations and MFCC baselines. The results demonstrate the effectiveness of nasality-focused representation learning in improving robustness to recording artifacts for real-world VPD screening.
Pre-training on nasal vs. oral context lets a simple model beat large pre-trained speech models at detecting speech disorders in noisy, real-world settings.
Velopharyngeal dysfunction (VPD) is characterized by inadequate velopharyngeal closure during speech and often causes hypernasality and reduced intelligibility. Although speech-based machine learning models can perform well under standardized clinical recording conditions, their performance often drops in real-world settings because of domain shift caused by differences in devices, channels, noise, and room acoustics. To improve robustness, we propose a two-stage framework for VPD screening. First, a nasality-focused speech representation is learned by supervised contrastive pre-training on an auxiliary corpus with phoneme alignments, using oral-context versus nasal-context supervision. Second, the encoder is frozen and used with lightweight classifiers on 0.5-second speech chunks, whose probabilities are aggregated to produce recording-level decisions with a fixed threshold. On an in-domain clinical cohort of 82 subjects, the proposed method achieved perfect recording-level screening performance (macro-F1 = 1.000, accuracy = 1.000). On a separate out-of-domain set of 131 heterogeneous public Internet recordings, large pretrained speech representations degraded substantially, while MFCC was the strongest baseline (macro-F1 = 0.612, accuracy = 0.641). The proposed method achieved the best out-of-domain performance (macro-F1 = 0.679, accuracy = 0.695), improving on the strongest baseline under the same evaluation protocol. These results suggest that learning a nasality-focused representation before clinical classification can reduce sensitivity to recording artifacts and improve robustness for deployable speech-based VPD screening.