Search papers, labs, and topics across Lattice.
This paper addresses the challenge of modifying creak in speech synthesis while preserving speaker identity, which is crucial for controllable and realistic voice manipulation. They achieve disentanglement of pitch and creak by augmenting the training data of a speech synthesis system with a speaker manipulation block based on conditional continuous normalizing flows (cCNF). Experiments demonstrate significantly improved speaker verification performance across varying creak manipulation strengths, indicating successful speaker identity preservation.
Modifying voice creak without losing speaker identity is now possible, thanks to a new disentanglement method.
We introduce a system capable of faithfully modifying the perceptual voice quality of creak while preserving the speaker's perceived identity. While it is well known that high creak probability is typically correlated with low pitch, it is important to note that this is a property observed on a population of speakers but does not necessarily hold across all situations. Disentanglement of pitch from creak is achieved by augmentation of the training dataset of a speech synthesis system with a speaker manipulation block based on conditional continuous normalizing flow. The experiments show greatly improved speaker verification performance over a range of creak manipulation strengths.