Feb 15, 2026arXiv:2602.14291

Bengali-Loop: Community Benchmarks for Long-Form Bangla ASR and Speaker Diarization

H. M. Shadman Tabib, Istiak Ahmmed Rifti, Abdullah Muhammed Amimul Ehsan, Somik Dasgupta, Md Zim Mim Siddiqee Sowdha, Abrar Jahin Sarker, Md. Rafiul Islam Nijamy, Tanvir Hossain, Mst. Metaly Khatun, Munzer Mahmood, Rakesh Debnath, Gourab Biswas, Asif Karim, Wahid Al Azad Navid, Masnoon Muztahid, Fuad Ahmed Udoy, Shahad Shahriar Rahman, Md. Tashdiqur Rahman Shifat, Most. Sonia Khatun, Mushfiqur Rahman, Md. Miraj Hasan, Anik Saha, Mohammad Ninad Mahmud Nobo, Soumik Bhattacharjee, Tusher Bhomik, Ahmmad Nur Swapnil, Shahriar Kabir

AI Summary

The authors introduce Bengali-Loop, two new community benchmarks for long-form Bangla speech processing: a 158.6-hour ASR corpus and a 22-hour speaker diarization corpus, both derived from YouTube content. They created the ASR corpus using a subtitle-extraction pipeline with human verification and manually annotated the speaker diarization corpus. Baseline results using Tugstugi for ASR (34.07% WER) and pyannote.audio for diarization (40.08% DER) are provided, along with standardized evaluation protocols.

Key Contribution

Despite being spoken by millions, Bengali speech tech gets a boost with Bengali-Loop, a new long-form ASR and diarization benchmark that finally tackles realistic, multi-speaker Bangla content.

Abstract

Bengali (Bangla) remains under-resourced in long-form speech technology despite its wide use. We present Bengali-Loop, two community benchmarks to address this gap: (1) a long-form ASR corpus of 191 recordings (158.6 hours, 792k words) from 11 YouTube channels, collected via a reproducible subtitle-extraction pipeline and human-in-the-loop transcript verification; and (2) a speaker diarization corpus of 24 recordings (22 hours, 5,744 annotated segments) with fully manual speaker-turn labels in CSV format. Both benchmarks target realistic multi-speaker, long-duration content (e.g., Bangla drama/natok). We establish baselines (Tugstugi: 34.07% WER; pyannote.audio: 40.08% DER) and provide standardized evaluation protocols (WER/CER, DER), annotation rules, and data formats to support reproducible benchmarking and future model development for Bangla long-form ASR and diarization.

Data Curation & Synthetic Data Eval Frameworks & Benchmarks Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Bengali-Loop: Community Benchmarks for Long-Form Bangla ASR and Speaker Diarization

Related Papers