Mar 9, 2026arXiv:2603.07865

SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving

Ayush Barik, Sofia Stoica, S. Stoica, Nikhil Sarda, Arnav Kethana, Abhinav Khanduja, A. Khanduja, Mucheng Xu, Muchen Xu, Fan Lai

AI Summary

SoundWeaver is introduced as a training-free, model-agnostic serving system to accelerate text-to-audio diffusion models by warm-starting the diffusion process from semantically similar cached audio. It employs a Reference Selector for retrieval and alignment, a Skip Gater to dynamically determine NFEs to skip, and a Cache Manager for maintaining cache utility. Experiments on real-world audio traces demonstrate a 1.8-3.0x latency reduction with a small cache, while preserving or improving perceptual quality.

Key Contribution

Text-to-audio diffusion just got a whole lot faster: SoundWeaver slashes latency by up to 3x without retraining, simply by cleverly reusing similar audio samples.

Abstract

Text-to-audio diffusion models produce high-fidelity audio but require tens of function evaluations (NFEs), incurring multi-second latency and limited throughput. We present SoundWeaver, the first training-free, model-agnostic serving system that accelerates text-to-audio diffusion by warm-starting from semantically similar cached audio. SoundWeaver introduces three components: a Reference Selector that retrieves and temporally aligns cached candidates via semantic and duration-aware gating; a Skip Gater that dynamically determines the percentage of NFEs to skip; and a lightweight Cache Manager that maintains cache utility through quality-aware eviction and refinement. On real-world audio traces, SoundWeaver achieves 1.8--3.0$ \times $ latency reduction with a cache of only ${\sim}$1K entries while preserving or improving perceptual quality.

Inference & Quantization Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References34

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SoundWeaver: Semantic Warm-Starting for Text-to-Audio Diffusion Serving

Related Papers