JHUMar 9, 2026arXiv:2603.08977

Universal Speech Content Factorization

Henry Li Xinyuan, Zexin Cai, Leibny Paola Garc'ia-Perera, Berrak Sisman, S. Khudanpur, Nicholas Andrews, Matthew Wiesner

AI Summary

Universal Speech Content Factorization (USCF) is introduced as a linear method for extracting low-rank speech representations that suppress speaker timbre while preserving phonetic content. It extends closed-set Speech Content Factorization to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from limited target speech. Experiments demonstrate that USCF effectively removes speaker-dependent variation, achieves competitive zero-shot voice conversion performance, and serves as a training-efficient timbre-disentangled speech feature for timbre-prompted text-to-speech models.

Key Contribution

Achieve zero-shot voice conversion competitive with methods requiring more data or training, using a simple, invertible linear method to disentangle speech content from speaker timbre.

Abstract

We propose Universal Speech Content Factorization (USCF), a simple and invertible linear method for extracting a low-rank speech representation in which speaker timbre is suppressed while phonetic content is preserved. USCF extends Speech Content Factorization, a closed-set voice conversion (VC) method, to an open-set setting by learning a universal speech-to-content mapping via least-squares optimization and deriving speaker-specific transformations from only a few seconds of target speech. We show through embedding analysis that USCF effectively removes speaker-dependent variation. As a zero-shot VC system, USCF achieves competitive intelligibility, naturalness, and speaker similarity compared to methods that require substantially more target-speaker data or additional neural training. Finally, we demonstrate that as a training-efficient timbre-disentangled speech feature, USCF features can serve as the acoustic representation for training timbre-prompted text-to-speech models. Speech samples and code are publicly available.

Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References30

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Universal Speech Content Factorization

Related Papers