Technische Hochschule Nürnberg Georg Simon OhmApr 23, 2026arXiv:2604.21628

Time vs. Layer: Locating Predictive Cues for Dysarthric Speech Descriptors in Wav2vec 2.0

Natalie Engert, Dominik Wagner, K. Riedhammer, Tobias Bocklet

AI Summary

This paper investigates which layers and time steps of Wav2vec 2.0 (W2V2) contain the most predictive information for different dysarthric speech descriptors. They regress five descriptors (intelligibility, imprecise consonants, inappropriate silences, harsh voice, and monoloudness) from W2V2 representations using attentive statistics pooling with layer-wise and time-wise aggregation. The results indicate that layer-wise representations are best for intelligibility, while time-wise modeling is superior for imprecise consonants, harsh voice, and monoloudness.

Key Contribution

Turns out where you look in Wav2vec 2.0's representations *really* matters: intelligibility lives in the layers, while articulation problems hide in the time steps.

Abstract

Wav2vec 2.0 (W2V2) has shown strong performance in pathological speech analysis by effectively capturing the characteristics of atypical speech. Despite its success, it remains unclear which components of its learned representations are most informative for specific downstream tasks. In this study, we address this question by investigating the regression of dysarthric speech descriptors using annotations from the Speech Accessibility Project dataset. We focus on five descriptors, each addressing a different aspect of speech or voice production: intelligibility, imprecise consonants, inappropriate silences, harsh voice and monoloudness. Speech representations are derived from a W2V2-based feature extractor, and we systematically compare layer-wise and time-wise aggregation strategies using attentive statistics pooling. Our results show that intelligibility is best captured through layer-wise representations, whereas imprecise consonants, harsh voice and monoloudness benefit from time-wise modeling. For inappropriate silences, no clear advantage could be observed for either approach.

Interpretability & Mechanistic Interp Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References27

Year2026

VenueIEEE International Conference on Acoustics, Speech, and Signal Processing

Related Papers

Finding related papers...

Search

Time vs. Layer: Locating Predictive Cues for Dysarthric Speech Descriptors in Wav2vec 2.0

Related Papers