Search papers, labs, and topics across Lattice.
The paper introduces Voice of India, a new closed-source ASR benchmark comprising 536 hours of unscripted telephonic speech across 15 major Indian languages and 139 regional clusters. The dataset addresses limitations of existing Indic ASR benchmarks by using real-world conversational speech and accounting for spelling variations common in Indian languages. Analysis of ASR performance reveals significant geographic disparities and sheds light on the impact of factors like audio quality and speaking rate on ASR accuracy.
Current ASR systems stumble significantly when faced with the nuances of real-world Indian speech, as revealed by a new benchmark exposing geographic performance disparities and the impact of audio quality, speaking rate, and device type.
Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations. We also analyze performance geographically at the district level, revealing disparities. Finally, we provide detailed analysis across factors such as audio quality, speaking rate, gender, and device type, highlighting where current ASR systems struggle and offering insights for improving real world Indic ASR systems.