UC SanUCSDUniversity of California-San Diego LaMay 21, 2026arXiv:2605.22717

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

Zachary Novack, Stephen Brade, Haven Kim, Hugo Flores García, Nithya Shikarpur, Chinmay Talegaonkar, Suwan Kim, Valerie K. Chen, Julian McAuley, Taylor Berg-KirkPatrick, Cheng-Zhi Anna Huang

AI Summary

This paper introduces Live Music Diffusion Models (LMDMs), a modification to standard audio diffusion models that enables efficient interactive music generation on consumer hardware. LMDMs achieve this by implementing block-wise KV caching during inference, recovering and surpassing the computational efficiency of discrete Live Music Models (LMMs). Furthermore, the authors introduce ARC-Forcing, a post-training alignment method that reduces error accumulation in LMDMs without relying on RL or reward models, facilitating creative applications like text-conditioned generation and live jamming.

Key Contribution

Diffusion models, typically too slow for interactive music generation, can now jam in real-time on a laptop thanks to a clever caching trick and a new alignment method.

Abstract

Interactive streaming music generation promises the use of generative models for live performance and co-creation that is impossible with offline models. However, SOTA models exist in the discrete-AR regime, requiring industrial levels of compute for both training and inference. In this work, we investigate whether audio diffusion models, with their wide support in the open-source community but non-streaming bidirectional nature, can be repurposed efficiently into interactive models accessible on consumer hardware. By taking a critical look at the modern pipeline for block-wise outpainting diffusion, we identify critical inefficiencies during inference that result in strictly worse computational efficiency than their discrete-AR counterparts. We propose Live Music Diffusion Models (LMDMs), a simple modification of the generative diffusion process that recovers, and then outperforms, the inference complexity of the discrete Live Music Models (LMMs) through block-wise KV Caching. Unlike LMMs, LMDMs further enable stable post-training alignment through our novel ARC-Forcing paradigm, reducing error accumulation without any explicit RL or reward models. We demonstrate the application of LMDMs in a number of creative domains, including text-conditioned generation, sketch-based music synthesis, and jamming. We finally show how LMDMs can be used as a generative instrument in a real artist-AI collaboration, utilizing LMDMs as a "generative delay" to transform musicians' improvisation live for variable timbral effects while running locally on a consumer gaming laptop.

Architecture Design (Transformers, SSMs, MoE)Speech & Audio Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

Related Papers