Mar 31, 2026arXiv:2603.29244

The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

Hillary Mutisya, J. Mugane, Gavin Nyamboga, Brian Chege, Mary Gathoni

AI Summary

The paper introduces the Thiomi Dataset, a large multimodal corpus for ten low-resource African languages, collected via a dedicated community platform and multi-tier quality assurance pipeline. They establish baseline ASR, MT, and TTS models for these languages, demonstrating the dataset's utility. Notably, their ASR system achieves a 3.24% WER on Swahili, significantly outperforming prior state-of-the-art results.

Key Contribution

Thiomi slashes Swahili ASR error rates by 61% and unlocks nine more African languages for multimodal AI, proving community-driven data collection can leapfrog existing benchmarks.

Abstract

We present the Thiomi Dataset, a large-scale multimodal corpus spanning ten African languages across four language families: Swahili, Kikuyu, Kamba, Kimeru, Luo, Maasai, Kipsigis, Somali (East Africa); Wolof (West Africa); and Fulani (West/Central Africa). The dataset contains over 601,000 approved sentence-level text annotations and over 385,000 audio recordings across nine languages, collected through a dedicated community data collection platform involving over 100 contributors. The Thiomi platform collected data for nine languages; Swahili data was supplemented with existing Common Voice recordings. A multi-tier quality assurance pipeline achieves 86-100% text approval rates for the six primary languages. To validate the dataset's utility, we train and evaluate ASR, MT, and TTS models, establishing baselines across all ten languages. Our best ASR system achieves 3.24% WER on Swahili (Common Voice), reducing prior academic SOTA from 8.3% to 3.24% (5.1 percentage point absolute, 61% relative reduction), and 4.3% WER on Somali. The dataset will be published on HuggingFace. We describe the collection platform, quality assurance workflows, and baseline experiments, and discuss implications for African language technology infrastructure.

Data Curation & Synthetic Data Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Thiomi Dataset: A Large-Scale Multimodal Corpus for Low-Resource African Languages

Related Papers