RUCMay 27, 2026arXiv:2605.28063

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

Yuyue Wang, Xihua Wang, Yi-Jing Chen, Ruihua Song

AI Summary

This paper introduces Free-Form-Text-Prompt-to-Unified-Audio generation, a new task focused on synthesizing unified audio containing speech, sound, and their composites directly from unconstrained natural language. To tackle this, they propose PlanAudio, an autoregressive LLM-based framework that uses a semantic latent chain-of-thought mechanism to bridge high-level semantic understanding and low-level acoustic synthesis. Experiments on the new PlanAudio-Bench benchmark demonstrate PlanAudio's superior performance compared to existing pipeline and unified baselines across speech, sound, and composite audio scenarios.

Key Contribution

Forget disjointed pipelines and structured inputs: PlanAudio uses an LLM and semantic latent chain-of-thought to directly synthesize unified audio from free-form text prompts.

Abstract

Audio generation has made significant progress, yet synthesizing unified audio where speech and sounds are naturally composited remains a challenge. Current methods either rely on disjoint pipelines, which fail to capture fine-grained interactions, or require structured inputs and external text rewriting, which limits the flexibility of free-form text prompts. In this paper, we introduce a new task: Free-Form-Text-Prompt-to-Unified-Audio generation, which aims to directly synthesize unified audio containing speech, sound, and their composites from unconstrained natural language. To address this task, we propose PlanAudio, a unified, autoregressive LLM-based framework. First, it simplifies the model architecture by leveraging intrinsic LLM reasoning capability instead of traditional text encoders. Second, it introduces a semantic latent chain-of-thought mechanism, an implicit planning mechanism that bridges high-level semantic understanding and low-level acoustic synthesis. Furthermore, we create PlanAudio-Bench, a specialized benchmark for evaluating composite audio scenarios. We perform evaluations in the scenarios of speech, sound, and their composites. The results demonstrate that PlanAudio generally outperforms the existing pipeline and unified baselines, while staying competitive with models designed for a single scenario. Our analysis further reveals the superiority of semantic latent CoT over other CoT mechanisms and highlights the importance of continuous multi-scenario training curricula.

Multimodal Models Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

Related Papers