DAMOUSTCJun 1, 2026arXiv:2606.01677

UniVocal: Unified Speech-Singing Code-Switching Synthesis

Yufei Shi, Qian Chen, Zhen-Hua Ling, Yang Ai

AI Summary

UniVocal introduces a novel framework for Speech-Singing Code-Switching (SCS) Synthesis that autonomously transitions between vocal modes based on text context, eliminating the need for explicit switching-control tags. Utilizing a two-stage curriculum learning strategy, the system efficiently trains a TTS model to synthesize diverse and natural code-switching data while addressing data scarcity through a scalable pipeline. Experimental results reveal that UniVocal not only sets a new benchmark on the SCSBench but also maintains strong performance in traditional speech and singing tasks, showcasing its versatility and effectiveness.

Key Contribution

Seamless transitions between speech and singing modes are now driven purely by text context, achieving state-of-the-art results in code-switching synthesis.

Abstract

We propose UniVocal, a unified framework that implicitly infers vocal modes from text context to pioneer Speech-Singing Code-Switching (SCS) Synthesis - a task where transitions are autonomously driven by textual semantics, akin to seamless human language blending. Unlike single-mode generation or systems relying on switching-control tags, our proposed UniVocal implicitly infers vocal modes solely from text context. To achieve this, we employ a data-efficient two-stage curriculum learning strategy that progressively trains a competitive TTS system to acquire the desired SCS capability. Addressing data scarcity, we introduce a scalable pipeline to synthesize diverse code-switching data that is both semantically and acoustically natural, alongside a new multi-scenario benchmark, SCSBench. To address limitations of semantic tokenizers in capturing acoustic details, we also introduce refined cent token and Chain-of-Thought (CoT) generation for planning prosody before content generation, effectively enhancing empathetic speech generation and singing melody. Experimental results demonstrate that UniVocal achieves state-of-the-art performance on SCSBench while maintaining competitive performance on regular speech and singing tasks. Audio samples are available at https://project-univocal-demo.github.io/demo/. The code and dataset are released at https://github.com/FunAudioLLM/FunResearch/tree/main/UniVocal.

Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

UniVocal: Unified Speech-Singing Code-Switching Synthesis

Related Papers