Jun 8, 2026arXiv:2606.09050

MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion

Guobin Ma, Yuxuan Xia, Yuepeng Jiang, Dake Guo, Hanke Xie, Jingbin Hu, Yanbo Wang, Lei Xie, Pengcheng Zhu

AI Summary

MeanVC 2 enhances streaming zero-shot voice conversion by addressing the limitations of its predecessor, MeanVC, through the introduction of future-receptive chunking (FRC) and a universal timbre token encoder. FRC optimizes the use of past and future context in the diffusion transformer decoder, enabling stable conversion with a reduced chunk size of 40 ms, while the new timbre encoder improves robustness to low-quality audio references. Experimental results demonstrate that MeanVC 2 not only surpasses MeanVC in conversion quality but also significantly reduces latency from 211 ms to 110 ms, making it more suitable for real-time applications.

Key Contribution

MeanVC 2 cuts voice conversion latency in half while enhancing robustness to low-quality audio references, revolutionizing real-time voice applications.

Abstract

Streaming zero-shot voice conversion (VC) has become increasingly popular due to its potential for real-time applications. The recently proposed MeanVC achieves lightweight streaming zero-shot VC, but it has several limitations: its chunk-wise autoregressive denoising doubles the effective training sequence length, conversion quality degrades under small-chunk settings, and its timbre encoder directly relies on reference mel-spectrograms, making it sensitive to reference audio quality. To address these limitations we propose MeanVC 2. We introduce future-receptive chunking (FRC), which explicitly schedules past and future receptive fields across diffusion transformer decoder layers and removes clean-chunk teacher forcing. By incorporating bounded future context, FRC enables stable conversion with a 40 ms chunk size. We further introduce a universal timbre token encoder, which constructs a timbre representation from a global speaker embedding and retrieves fine-grained timbre cues via cross-attention, improving robustness to low-quality references and enhancing zero-shot speaker similarity. Experimental results show that MeanVC 2 significantly outperforms MeanVC, while reducing latency from 211 ms to 110 ms. Audio samples are publicly available. The source code will be publicly released.

Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion

Related Papers