UC SanUCSDFeb 10, 2026arXiv:2602.09891

Stemphonic: All-at-once Flexible Multi-stem Music Generation

Shih-Lun Wu, Ge Zhu, Juan-Pablo Caceres, C. Huang, Cheng-Zhi Anna Huang, Nicholas J. Bryan

AI Summary

The paper introduces Stemphonic, a diffusion/flow-based framework for generating multiple synchronized music stems in a single inference pass, addressing the limitations of existing methods that are either inflexible or slow. Stemphonic achieves this by treating each stem as a batch element during training, grouping synchronized stems, and applying a shared noise latent to each group, enabling efficient generation of variable stem combinations. Experiments on open-source stem evaluation sets demonstrate that Stemphonic achieves higher-quality outputs and accelerates full mix generation by 25-50%.

Key Contribution

Generate entire multi-instrumental tracks in one pass with Stemphonic, a new diffusion/flow model that's 25-50% faster and higher quality than existing stem generation methods.

Abstract

Music stem generation, the task of producing musically-synchronized and isolated instrument audio clips, offers the potential of greater user control and better alignment with musician workflows compared to conventional text-to-music models. Existing stem generation approaches, however, either rely on fixed architectures that output a predefined set of stems in parallel, or generate only one stem at a time, resulting in slow inference despite flexibility in stem combination. We propose Stemphonic, a diffusion-/flow-based framework that overcomes this trade-off and generates a variable set of synchronized stems in one inference pass. During training, we treat each stem as a batch element, group synchronized stems in a batch, and apply a shared noise latent to each group. At inference-time, we use a shared initial noise latent and stem-specific text inputs to generate synchronized multi-stem outputs in one pass. We further expand our approach to enable one-pass conditional multi-stem generation and stem-wise activity controls to empower users to iteratively generate and orchestrate the temporal layering of a mix. We benchmark our results on multiple open-source stem evaluation sets and show that Stemphonic produces higher-quality outputs while accelerating the full mix generation process by 25 to 50%. Demos at: https://stemphonic-demo.vercel.app.

Architecture Design (Transformers, SSMs, MoE)Speech & Audio

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Stemphonic: All-at-once Flexible Multi-stem Music Generation

Related Papers