RWTHFeb 16, 2026arXiv:2602.14686

Disentangling Pitch and Creak for Speaker Identity Preservation in Speech Synthesis

Frederik Rautenberg, Jana Wiechmann, Petra Wagner, Reinhold Haeb-Umbach

AI Summary

This paper addresses the challenge of modifying creak in speech synthesis while preserving speaker identity, which is crucial for controllable and realistic voice manipulation. They achieve disentanglement of pitch and creak by augmenting the training data of a speech synthesis system with a speaker manipulation block based on conditional continuous normalizing flows (cCNF). Experiments demonstrate significantly improved speaker verification performance across varying creak manipulation strengths, indicating successful speaker identity preservation.

Key Contribution

Modifying voice creak without losing speaker identity is now possible, thanks to a new disentanglement method.

Abstract

We introduce a system capable of faithfully modifying the perceptual voice quality of creak while preserving the speaker's perceived identity. While it is well known that high creak probability is typically correlated with low pitch, it is important to note that this is a property observed on a population of speakers but does not necessarily hold across all situations. Disentanglement of pitch from creak is achieved by augmentation of the training dataset of a speech synthesis system with a speaker manipulation block based on conditional continuous normalizing flow. The experiments show greatly improved speaker verification performance over a range of creak manipulation strengths.

Data Curation & Synthetic Data Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Disentangling Pitch and Creak for Speaker Identity Preservation in Speech Synthesis

Related Papers