Search papers, labs, and topics across Lattice.
UniVocal introduces a novel framework for Speech-Singing Code-Switching (SCS) Synthesis that autonomously transitions between vocal modes based on text context, eliminating the need for explicit switching-control tags. Utilizing a two-stage curriculum learning strategy, the system efficiently trains a TTS model to synthesize diverse and natural code-switching data while addressing data scarcity through a scalable pipeline. Experimental results reveal that UniVocal not only sets a new benchmark on the SCSBench but also maintains strong performance in traditional speech and singing tasks, showcasing its versatility and effectiveness.
Seamless transitions between speech and singing modes are now driven purely by text context, achieving state-of-the-art results in code-switching synthesis.
We propose UniVocal, a unified framework that implicitly infers vocal modes from text context to pioneer Speech-Singing Code-Switching (SCS) Synthesis - a task where transitions are autonomously driven by textual semantics, akin to seamless human language blending. Unlike single-mode generation or systems relying on switching-control tags, our proposed UniVocal implicitly infers vocal modes solely from text context. To achieve this, we employ a data-efficient two-stage curriculum learning strategy that progressively trains a competitive TTS system to acquire the desired SCS capability. Addressing data scarcity, we introduce a scalable pipeline to synthesize diverse code-switching data that is both semantically and acoustically natural, alongside a new multi-scenario benchmark, SCSBench. To address limitations of semantic tokenizers in capturing acoustic details, we also introduce refined cent token and Chain-of-Thought (CoT) generation for planning prosody before content generation, effectively enhancing empathetic speech generation and singing melody. Experimental results demonstrate that UniVocal achieves state-of-the-art performance on SCSBench while maintaining competitive performance on regular speech and singing tasks. Audio samples are available at https://project-univocal-demo.github.io/demo/. The code and dataset are released at https://github.com/FunAudioLLM/FunResearch/tree/main/UniVocal.