Search papers, labs, and topics across Lattice.
The authors introduce Bengali-Loop, two new community benchmarks for long-form Bangla speech processing: a 158.6-hour ASR corpus and a 22-hour speaker diarization corpus, both derived from YouTube content. They created the ASR corpus using a subtitle-extraction pipeline with human verification and manually annotated the speaker diarization corpus. Baseline results using Tugstugi for ASR (34.07% WER) and pyannote.audio for diarization (40.08% DER) are provided, along with standardized evaluation protocols.
Despite being spoken by millions, Bengali speech tech gets a boost with Bengali-Loop, a new long-form ASR and diarization benchmark that finally tackles realistic, multi-speaker Bangla content.
Bengali (Bangla) remains under-resourced in long-form speech technology despite its wide use. We present Bengali-Loop, two community benchmarks to address this gap: (1) a long-form ASR corpus of 191 recordings (158.6 hours, 792k words) from 11 YouTube channels, collected via a reproducible subtitle-extraction pipeline and human-in-the-loop transcript verification; and (2) a speaker diarization corpus of 24 recordings (22 hours, 5,744 annotated segments) with fully manual speaker-turn labels in CSV format. Both benchmarks target realistic multi-speaker, long-duration content (e.g., Bangla drama/natok). We establish baselines (Tugstugi: 34.07% WER; pyannote.audio: 40.08% DER) and provide standardized evaluation protocols (WER/CER, DER), annotation rules, and data formats to support reproducible benchmarking and future model development for Bangla long-form ASR and diarization.