Mar 5, 2026arXiv:2603.04809

WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech

Aurchi Chowdhury, Rubaiyat -E-Zaman, Sk. Ashrafuzzaman Nafees

AI Summary

This paper introduces WhisperAlign, a system for Bengali long-form speech recognition and speaker diarization. It uses Whisper-timestamped audio chunking for accurate ASR and fine-tunes a Pyannote segmentation model on a Bengali dataset for improved diarization. The system achieves state-of-the-art results in Bengali ASR and diarization by intelligently chunking audio and adapting segmentation models to the specific nuances of Bengali conversational speech.

Key Contribution

Domain-specific fine-tuning of Pyannote segmentation models dramatically improves diarization accuracy for low-resource Bengali speech, even with overlapping speakers.

Abstract

This paper presents our solution for the DL Sprint 4.0, addressing the dual challenges of Bengali Long-Form Speech Recognition (Task 1) and Speaker Diarization (Task 2). Processing long-form, multi-speaker Bengali audio introduces significant hurdles in voice activity detection, overlapping speech, and context preservation. To solve the long-form transcription challenge, we implemented a robust audio chunking strategy utilizing whisper-timestamped, allowing us to feed precise, context-aware segments into our fine-tuned acoustic model for high-accuracy transcription. For the diarization task, we developed an integrated pipeline leveraging pyannote.audio and WhisperX. A key contribution of our approach is the domain-specific fine-tuning of the Pyannote segmentation model on the competition dataset. This adaptation allowed the model to better capture the nuances of Bengali conversational dynamics and accurately resolve complex, overlapping speaker boundaries. Our methodology demonstrates that applying intelligent timestamped chunking to ASR and targeted segmentation fine-tuning to diarization significantly drives down Word Error Rate (WER) and Diarization Error Rate (DER), in low-resource settings.

Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References3

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

WhisperAlign: Word-Boundary-Aware ASR and WhisperX-Anchored Pyannote Diarization for Long-Form Bengali Speech

Related Papers