CNRSGETALP TeamLAMSADEMILES TeamNLP teamUniversité Grenoble AlpesUniversité Paris Dauphine-PSLJun 9, 2026arXiv:2606.10654

Speaker Group Encoding in Self-supervised Speech Recognition Models

Felix Herron, Solange Rossato Alexandre Allauzen, Benoit Favre, François Portet

AI Summary

This study explores how self-supervised speech recognition models (S3Ms) learn to encode information about speaker groups (SGs) across various training states, including pretraining and finetuning for speaker identification (SID) and automatic speech recognition (ASR). The findings reveal that while finetuning for SID enhances the encoding of phonetically variant speaker group categories, ASR finetuning tends to discard this information while retaining semantically variant data. Additionally, fairness-enhancing algorithms in ASR influence the encoding of speaker group information, particularly for phonetically variant categories, highlighting the nuanced interplay between model training and speaker group representation.

Key Contribution

Finetuning speech recognition models can either amplify or erase critical speaker group information, depending on the training focus and fairness interventions.

Abstract

We investigate what self-supervised speech recognition models (S3Ms) learn about speaker groups (SGs). We examine several states of S3Ms: pretrained, finetuned on speaker identification (SID), finetuned on automatic speech recognition (ASR), and ASR-finetuned using a fairness enhancing algorithm. We find that S3Ms encode information about several speaker group categories (SGCs), including their gender, age, dialect, ethnicity, and whether they are a native speaker. We find that finetuning for SID amplifies certain SGCs, namely those whose variance is more phonetic in nature, though it does not amplify other SGCs, namely those whose variance is more semantic in nature. On the other hand, finetuning for ASR discards phonetically variant speaker group information (SGI) but retains semantically variant SGI. We find that ASR algorithms designed for fairness improvement change to what extent SGI is encoded in S3Ms; however, this is primarily true for for phonetically variant SGCs, and less true for semantically variant SGCs. We discuss how SGI is encoded by each layer, and identify subdimensions of embeddings responsible for encoding different SGCs. Finally, we discuss how our findings could be beneficial in designing fairer ASR algorithms.

Constitutional AI & AI Ethics Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Speaker Group Encoding in Self-supervised Speech Recognition Models

Related Papers