Search papers, labs, and topics across Lattice.
The paper introduces Speech Generation Speaker Poisoning (SGSP), a framework for removing specific speaker identities from zero-shot TTS models to mitigate privacy risks. They evaluate inference-time filtering and parameter-modification baselines, measuring the trade-off between utility (WER) and privacy (AUC, FSSIM). Results show strong privacy for up to 15 speakers, but scalability limitations emerge at 100 speakers due to identity overlap.
You can now poison a zero-shot TTS model to prevent it from generating speech for specific target speakers, but scaling this defense to a large number of speakers remains a challenge.
Zero-shot Text-to-Speech (TTS) voice cloning poses severe privacy risks, demanding the removal of specific speaker identities from trained TTS models. Conventional machine unlearning is insufficient in this context, as zero-shot TTS can dynamically reconstruct voices from just reference prompts. We formalize this task as Speech Generation Speaker Poisoning (SGSP), in which we modify trained models to prevent the generation of specific identities while preserving utility for other speakers. We evaluate inference-time filtering and parameter-modification baselines across 1, 15, and 100 forgotten speakers. Performance is assessed through the trade-off between utility (WER) and privacy, quantified using AUC and Forget Speaker Similarity (FSSIM). We achieve strong privacy for up to 15 speakers but reveal scalability limits at 100 speakers due to increased identity overlap. Our study thus introduces a novel problem and evaluation framework toward further advances in generative voice privacy.