Feb 26, 2026arXiv:2602.23070

Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Sanjid Hasan, Sanjid Hasan, Risalat Labib, Risalat Labib, A H M Fuad, A. Fuad, Bayazid Hasan, Bayazid Hasan

AI Summary

The authors introduce Lipi-Ghor-882, a new 882-hour multi-speaker Bengali dataset for ASR and speaker diarization, addressing the scarcity of resources for long-form Bengali speech processing. They found that for ASR, fine-tuning with perfectly aligned annotations and synthetic acoustic degradation is most effective, while raw data scaling is not. For speaker diarization, strategic post-processing of baseline model outputs significantly outperformed model retraining, achieving a 0.019 Real-Time Factor (RTF) for the dual pipeline.

Key Contribution

Counterintuitively, for low-resource Bengali speech processing, targeted data augmentation and heuristic post-processing outperform raw data scaling and extensive model retraining for ASR and speaker diarization, respectively.

Abstract

Although Automatic Speech Recognition (ASR) in Bengali has seen significant progress, processing long-duration audio and performing robust speaker diarization remain critical research gaps. To address the severe scarcity of joint ASR and diarization resources for this language, we introduce Lipi-Ghor-882, a comprehensive 882-hour multi-speaker Bengali dataset. In this paper, detailing our submission to the DL Sprint 4.0 competition, we systematically evaluate various architectures and approaches for long-form Bengali speech. For ASR, we demonstrate that raw data scaling is ineffective; instead, targeted fine-tuning utilizing perfectly aligned annotations paired with synthetic acoustic degradation (noise and reverberation) emerges as the singular most effective approach. Conversely, for speaker diarization, we observed that global open-source state-of-the-art models (such as Diarizen) performed surprisingly poorly on this complex dataset. Extensive model retraining yielded negligible improvements; instead, strategic, heuristic post-processing of baseline model outputs proved to be the primary driver for increasing accuracy. Ultimately, this work outlines a highly optimized dual pipeline achieving a $\sim$0.019 Real-Time Factor (RTF), establishing a practical, empirically backed benchmark for low-resource, long-form speech processing.

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References8

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Make It Hard to Hear, Easy to Learn: Long-Form Bengali ASR and Speaker Diarization via Extreme Augmentation and Perfect Alignment

Related Papers