Search papers, labs, and topics across Lattice.
This paper explores long-form Bengali speech transcription and speaker diarization, two tasks for which Bengali is a low-resource language. The authors fine-tuned Whisper Medium for transcription and integrated pyannote/speaker-diarization-community-1 with a custom segmentation model for diarization. Through hyperparameter tuning and strategic data utilization, they achieved a DER of 0.27 and a WER of 0.38 on the private leaderboard of the DL Sprint 4.0 competition.
Fine-tuning Whisper and Pyannote models, combined with strategic data handling, significantly narrows the gap in speech technology performance for low-resource languages like Bengali.
Bengali remains a low-resource language in speech technology, especially for complex tasks like long-form transcription and speaker diarization. This paper presents a multistage approach developed for the"DL Sprint 4.0 - Bengali Long-Form Speech Recognition"and"DL Sprint 4.0 - Bengali Speaker Diarization"competitions on Kaggle, addressing the challenge of"who spoke when/what"in hour-long recordings. We implemented Whisper Medium fine-tuned on Bengali data (bengaliAI/tugstugi bengaliai-asr whisper-medium) for transcription and integrated pyannote/speaker-diarization-community-1 with our custom-trained segmentation model to handle diverse and noisy acoustic environments. Using a two-pass method with hyperparameter tuning, we achieved a DER of 0.27 on the private leaderboard and 0.19 on the public leaderboard. For transcription, chunking, background noise cleaning, and algorithmic post-processing yielded a WER of 0.38 on the private leaderboard. These results show that targeted tuning and strategic data utilization can significantly improve AI inclusivity for South Asian languages. All relevant code is available at: https://github.com/Short-Potatoes/Bengali-long-form-transcription-and-diarization.git Index Terms: Bengali speech recognition, speaker diarization, Whisper, ASR, low-resource languages, pyannote, voice activity detection