Search papers, labs, and topics across Lattice.
This paper explores supervised contrastive learning (SupCon) as an auxiliary objective to improve the accent robustness of CTC-finetuned ASR systems based on self-supervised pretraining. SupCon is used as an utterance-level contrastive loss to regularize encoder representations without requiring architectural changes or explicit accent labels. Experiments on L2-ARCTIC demonstrate consistent WER reductions, up to 25-29% relative, on unseen accents, indicating improved accent invariance.
Make your ASR models 25% more accent-robust with this surprisingly simple contrastive loss trick.
ASR systems based on self-supervised acoustic pretraining and CTC fine-tuning achieve strong performance on native speech but remain sensitive to accent variability. We investigate supervised contrastive learning (SupCon) as a lightweight, accent-invariant auxiliary objective for CTC fine-tuning. An utterance-level contrastive loss regularizes encoder representations without architectural modification or explicit accent supervision. Experiments on the L2-ARCTIC benchmark show consistent WER reductions across multiple pretrained encoders, with up to 25 -- 29\% relative reduction under unseen-accent evaluation. Analysis using within-transcript cosine dispersion indicates that SupCon promotes more compact and stable representation geometry under accent variability. Overall, SupCon provides an effective and model-agnostic regularization strategy for improving accent robustness.