Apr 9, 2026arXiv:2604.08448

AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

L. Wanzare, Lilian Wanzare, Cynthia Amol, Cynthia Amol, zekiel Maina, zekiel Maina, N. Odhiambo, Nelson Odhiambo, Hope Kerubo, Hope Kerubo, Leila Misula, Leila Misula, Vivian Oloo, V. Oloo, Rennish Mboya, Rennish Mboya, Edwin Onkoba, Edwin Onkoba, Edward Ombui, Edward Ombui, Joseph K. Muguro, Joseph Muguro, Ciira wa Maina, C. Maina, Andrew Kipkebut, Andrew Kipkebut, Alfred Omondi Otom, A. O. Otom, Ian Ndung'u Kang'ethe, Ian Ndung'u Kang'ethe, Angela Wambui Kanyi, Angela Wambui Kanyi, Brian Gichana Omwenga, B. Omwenga

AI Summary

AfriVoices-KE introduces a 3,000-hour multilingual speech dataset covering five Kenyan languages (Dholuo, Kikuyu, Kalenjin, Maasai, and Somali), significantly expanding resources for African language speech technology. The dataset includes both scripted and spontaneous speech, collected via a custom mobile application and rigorous quality control. This resource addresses a critical gap in representation and enables the development of more inclusive speech recognition and text-to-speech systems for these languages.

Key Contribution

Finally, a large-scale (3,000 hour) multilingual speech dataset for five Kenyan languages is available, unlocking possibilities for speech technology in previously under-represented linguistic communities.

Abstract

AfriVoices-KE is a large-scale multilingual speech dataset comprising approximately 3,000 hours of audio across five Kenyan languages: Dholuo, Kikuyu, Kalenjin, Maasai, and Somali. The dataset includes 750 hours of scripted speech and 2,250 hours of spontaneous speech, collected from 4,777 native speakers across diverse regions and demographics. This work addresses the critical underrepresentation of African languages in speech technology by providing a high-quality, linguistically diverse resource. Data collection followed a dual methodology: scripted recordings drew from compiled text corpora, translations, and domain-specific generated sentences spanning eleven domains relevant to the Kenyan context, while unscripted speech was elicited through textual and image prompts to capture natural linguistic variation and dialectal nuances. A customized mobile application enabled contributors to record using smartphones. Quality assurance operated at multiple layers, encompassing automated signal-to-noise ratio validation prior to recording and human review for content accuracy. Though the project encountered challenges common to low-resource settings, including unreliable infrastructure, device compatibility issues, and community trust barriers, these were mitigated through local mobilizers, stakeholder partnerships, and adaptive training protocols. AfriVoices-KE provides a foundational resource for developing inclusive automatic speech recognition and text-to-speech systems, while advancing the digital preservation of Kenya's linguistic heritage.

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References15

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AfriVoices-KE: A Multilingual Speech Dataset for Kenyan Languages

Related Papers