Search papers, labs, and topics across Lattice.
The paper introduces Emotion-Aware Prefix, a novel method for explicit emotion control in zero-shot voice conversion, built on a two-stage voice conversion backbone. By jointly controlling sequence modulation and acoustic realization with the prefix, the method achieves a substantial improvement in emotion conversion accuracy, doubling the baseline from 42.40% to 85.50%. The approach also maintains linguistic integrity, speech quality, and speaker identity, demonstrating its effectiveness and generalizability.
Double the emotion conversion accuracy in voice conversion models with a simple prefix that jointly controls sequence modulation and acoustic realization.
Recent advances in zero-shot voice conversion have exhibited potential in emotion control, yet the performance is suboptimal or inconsistent due to their limited expressive capacity. We propose Emotion-Aware Prefix for explicit emotion control in a two-stage voice conversion backbone. We significantly improve emotion conversion performance, doubling the baseline Emotion Conversion Accuracy (ECA) from 42.40% to 85.50% while maintaining linguistic integrity and speech quality, without compromising speaker identity. Our ablation study suggests that a joint control of both sequence modulation and acoustic realization is essential to synthesize distinct emotions. Furthermore, comparative analysis verifies the generalizability of proposed method, while it provides insights on the role of acoustic decoupling in maintaining speaker identity.