Search papers, labs, and topics across Lattice.
The authors address the scarcity of Arabic TTS training data by creating a pipeline for automatically collecting and processing 4,000 hours of Arabic speech data using VAD, ASR, automatic diacritization, and noise filtering. They trained voice cloning TTS models on varying amounts of this data (100, 1000, and 4000 hours) with and without diacritization. Results demonstrate that while diacritized data improves performance, scaling the training data significantly mitigates the need for diacritics in Arabic TTS.
Forget perfect diacritization: scaling data volume can overcome its absence in Arabic TTS, unlocking new possibilities for low-resource language synthesis.
Arabic Text-to-Speech (TTS) research has been hindered by the availability of both publicly available training data and accurate Arabic diacritization models. In this paper, we address the limitation by exploring Arabic TTS training on large automatically annotated data. Namely, we built a robust pipeline for collecting Arabic recordings and processing them automatically using voice activity detection, speech recognition, automatic diacritization, and noise filtering, resulting in around 4,000 hours of Arabic TTS training data. We then trained several robust TTS models with voice cloning using varying amounts of data, namely 100, 1,000, and 4,000 hours with and without diacritization. We show that though models trained on diacritized data are generally better, larger amounts of training data compensate for the lack of diacritics to a significant degree. We plan to release a public Arabic TTS model that works without the need for diacritization.