Search papers, labs, and topics across Lattice.
The paper introduces Nwāchā Munā, a 5.39-hour manually transcribed Devanagari speech corpus for Nepal Bhasha, an under-resourced language. It establishes a benchmark for Nepal Bhasha ASR using script-preserving acoustic modeling and investigates cross-lingual transfer learning from Nepali. Fine-tuning a Nepali Conformer model achieves a 17.59% CER, matching the performance of the multilingual Whisper-Small model, demonstrating the effectiveness of proximal transfer learning.
Forget massive multilingual models: fine-tuning on just 5 hours of speech data from a related language slashes ASR error rates for an endangered language, rivaling the performance of Whisper-Small.
Nepal Bhasha (Newari), an endangered language of the Kathmandu Valley, remains digitally marginalized due to the severe scarcity of annotated speech resources. In this work, we introduce Nwāchā Munā, a newly curated 5.39-hour manually transcribed Devanagari speech corpus for Nepal Bhasha, and establish the first benchmark using script-preserving acoustic modeling. We investigate whether proximal cross-lingual transfer from a geographically and linguistically adjacent language (Nepali) can rival large-scale multilingual pretraining in an ultra-low-resource Automatic Speech Recognition (ASR) setting. Fine-tuning a Nepali Conformer model reduces the Character Error Rate (CER) from a 52.54% zero-shot baseline to 17.59% with data augmentation, effectively matching the performance of the multilingual Whisper-Small model despite utilizing significantly fewer parameters. Our findings demonstrate that proximal transfer within South Asian language clusters serves as a computationally efficient alternative to massive multilingual models. We openly release the dataset and benchmarks to digitally enable the Newari community and foster further research in Nepal Bhasha.