Mar 2, 2026arXiv:2603.01622

More Data, Fewer Diacritics: Scaling Arabic TTS

AI Summary

The authors address the scarcity of Arabic TTS training data by creating a pipeline for automatically collecting and processing 4,000 hours of Arabic speech data using VAD, ASR, automatic diacritization, and noise filtering. They trained voice cloning TTS models on varying amounts of this data (100, 1000, and 4000 hours) with and without diacritization. Results demonstrate that while diacritized data improves performance, scaling the training data significantly mitigates the need for diacritics in Arabic TTS.

Key Contribution

Forget perfect diacritization: scaling data volume can overcome its absence in Arabic TTS, unlocking new possibilities for low-resource language synthesis.

Abstract

Arabic Text-to-Speech (TTS) research has been hindered by the availability of both publicly available training data and accurate Arabic diacritization models. In this paper, we address the limitation by exploring Arabic TTS training on large automatically annotated data. Namely, we built a robust pipeline for collecting Arabic recordings and processing them automatically using voice activity detection, speech recognition, automatic diacritization, and noise filtering, resulting in around 4,000 hours of Arabic TTS training data. We then trained several robust TTS models with voice cloning using varying amounts of data, namely 100, 1,000, and 4,000 hours with and without diacritization. We show that though models trained on diacritized data are generally better, larger amounts of training data compensate for the lack of diacritics to a significant degree. We plan to release a public Arabic TTS model that works without the need for diacritization.

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

More Data, Fewer Diacritics: Scaling Arabic TTS

Related Papers