NVIDIACUHKNTUShanghai AI LabFeb 18, 2025arXiv:2502.13128

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

Zihan Liu, Shuangrui Ding, Zhixiong Zhang, Xiao-wen Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Dahua Lin, Jiaqi Wang

AI Summary

The paper introduces SongGen, a single-stage auto-regressive transformer model for text-to-song generation that addresses limitations of multi-stage approaches. SongGen allows for fine-grained control over musical attributes like lyrics, instrumentation, genre, and timbre, and supports voice cloning via a reference clip. The model is trained with different token pattern strategies for mixed and dual-track output modes, and the authors demonstrate improved generation quality with their approach.

Key Contribution

Ditch the clunky pipelines: SongGen generates complete songs from text in a single pass, offering unprecedented control over musical elements and voice cloning.

Abstract

Text-to-song generation, the task of creating vocals and accompaniment from textual inputs, poses significant challenges due to domain complexity and data scarcity. Existing approaches often employ multi-stage generation procedures, leading to cumbersome training and inference pipelines, as well as suboptimal overall generation quality due to error accumulation across stages. In this paper, we propose SongGen, a fully open-source, single-stage auto-regressive transformer designed for controllable song generation. The proposed model facilitates fine-grained control over diverse musical attributes, including lyrics and textual descriptions of instrumentation, genre, mood, and timbre, while also offering an optional three-second reference clip for voice cloning. Within a unified auto-regressive framework, SongGen supports two output modes: mixed mode, which generates a mixture of vocals and accompaniment directly, and dual-track mode, which synthesizes them separately for greater flexibility in downstream applications. We explore diverse token pattern strategies for each mode, leading to notable improvements and valuable insights. Furthermore, we design an automated data preprocessing pipeline with effective quality control. To foster community engagement and future research, we will release our model weights, training code, annotated data, and preprocessing pipeline. The code is available at https://github.com/LiuZH-19/SongGen.

Architecture Design (Transformers, SSMs, MoE)Open-Source Models & Weights Speech & Audio

Citation Metrics

Citations23

Influential citations1

References47

Year2025

VenueInternational Conference on Machine Learning

Related Papers

Finding related papers...

Search

SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song Generation

Related Papers