Mar 4, 2026arXiv:2603.04219

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

Youngwon Choi, Jinwook Oh, Jinwoo Oh, Hwayeon Kim, Hyeonyu Kim

AI Summary

This paper introduces ZeSTA, a domain-conditioned training framework for augmenting limited real speech data with zero-shot TTS outputs for personalized speech synthesis. ZeSTA uses a domain embedding to differentiate real and synthetic speech, coupled with real-data oversampling to stabilize adaptation. Experiments on LibriTTS and an in-house dataset show that ZeSTA improves speaker similarity compared to naive synthetic augmentation, while maintaining intelligibility and perceptual quality.

Key Contribution

Domain-conditioned training with ZeSTA lets you effectively inject synthetic speech data into personalized TTS pipelines, boosting speaker similarity without sacrificing audio quality when real data is scarce.

Abstract

We investigate the use of zero-shot text-to-speech (ZS-TTS) as a data augmentation source for low-resource personalized speech synthesis. While synthetic augmentation can provide linguistically rich and phonetically diverse speech, naively mixing large amounts of synthetic speech with limited real recordings often leads to speaker similarity degradation during fine-tuning. To address this issue, we propose ZeSTA, a simple domain-conditioned training framework that distinguishes real and synthetic speech via a lightweight domain embedding, combined with real-data oversampling to stabilize adaptation under extremely limited target data, without modifying the base architecture. Experiments on LibriTTS and an in-house dataset with two ZS-TTS sources demonstrate that our approach improves speaker similarity over naive synthetic augmentation while preserving intelligibility and perceptual quality.

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References38

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

Related Papers