Division of Pediatric Plastic SurgeryVanderbiltMar 18, 2026arXiv:2603.17383

Robust Nasality Representation Learning for Cleft Palate-Related Velopharyngeal Dysfunction Screening in Real-World Settings

Weixin Liu, Bowen Qu, Amy Stone, Maria Powell, Maria E. Powell, Shama Dufresne, Stephane Braun, Stephane A. Braun, Izabela A. Galdyn, Izabela Galdyn, Michael Golinko, Bradley Malin, Zhijun Yin, Matthew E. Pontell

AI Summary

This paper introduces a two-stage framework for robust velopharyngeal dysfunction (VPD) screening, using supervised contrastive pre-training on an auxiliary corpus to learn a nasality-focused speech representation. By freezing the encoder and using lightweight classifiers on short speech chunks, the method achieves state-of-the-art out-of-domain performance on heterogeneous internet recordings, outperforming large pretrained speech representations and MFCC baselines. The results demonstrate the effectiveness of nasality-focused representation learning in improving robustness to recording artifacts for real-world VPD screening.

Key Contribution

Pre-training on nasal vs. oral context lets a simple model beat large pre-trained speech models at detecting speech disorders in noisy, real-world settings.

Abstract

Velopharyngeal dysfunction (VPD) is characterized by inadequate velopharyngeal closure during speech and often causes hypernasality and reduced intelligibility. Although speech-based machine learning models can perform well under standardized clinical recording conditions, their performance often drops in real-world settings because of domain shift caused by differences in devices, channels, noise, and room acoustics. To improve robustness, we propose a two-stage framework for VPD screening. First, a nasality-focused speech representation is learned by supervised contrastive pre-training on an auxiliary corpus with phoneme alignments, using oral-context versus nasal-context supervision. Second, the encoder is frozen and used with lightweight classifiers on 0.5-second speech chunks, whose probabilities are aggregated to produce recording-level decisions with a fixed threshold. On an in-domain clinical cohort of 82 subjects, the proposed method achieved perfect recording-level screening performance (macro-F1 = 1.000, accuracy = 1.000). On a separate out-of-domain set of 131 heterogeneous public Internet recordings, large pretrained speech representations degraded substantially, while MFCC was the strongest baseline (macro-F1 = 0.612, accuracy = 0.641). The proposed method achieved the best out-of-domain performance (macro-F1 = 0.679, accuracy = 0.695), improving on the strongest baseline under the same evaluation protocol. These results suggest that learning a nasality-focused representation before clinical classification can reduce sensitivity to recording artifacts and improve robustness for deployable speech-based VPD screening.

Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Robust Nasality Representation Learning for Cleft Palate-Related Velopharyngeal Dysfunction Screening in Real-World Settings

Related Papers