Search papers, labs, and topics across Lattice.
This paper explores continued pretraining (CPT) of wav2vec2-bert-2.0 for low-resource Swahili ASR by leveraging unlabeled audio and a small amount of labeled data via pseudo-labeling. The CPT approach is followed by supervised fine-tuning, significantly improving performance. The method achieves a 3.24% WER on Common Voice Swahili using only 20,000 labeled samples, outperforming previous state-of-the-art systems by a large margin.
You can slash ASR error rates in low-resource languages by over 60% with a simple continued pretraining recipe.
We investigate continued pretraining (CPT) for adapting wav2vec2-bert-2.0 to Swahili automatic speech recognition (ASR). Our approach combines unlabeled audio with limited labeled data through pseudo-labeled CPT followed by supervised finetuning. With 20,000 labeled samples, we achieve 3.24% WER on Common Voice Swahili-an 82% relative improvement over the baseline. This result surpasses the best previously reported academic system (8.3% WER from XLS-R) by 61% relative improvement. We provide concrete data requirements and a replicable methodology applicable to other low-resource languages.