Mar 19, 2026arXiv:2603.19176

Few-shot Acoustic Synthesis with Multimodal Flow Matching

AI Summary

This paper introduces Flow-matching Acoustic Generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given sparse scene context. FLAC uses a diffusion transformer trained with a flow-matching objective, conditioned on spatial, geometric, and acoustic cues, to generate RIRs at arbitrary positions in novel scenes. Experiments on AcousticRooms and Hearing Anything Anywhere datasets demonstrate that FLAC outperforms state-of-the-art eight-shot baselines with only one-shot learning.

Key Contribution

Synthesizing realistic room acoustics from a single recording is now possible, thanks to a novel flow-matching approach that captures the uncertainty inherent in acoustic environments.

Abstract

Generating audio that is acoustically consistent with a scene is essential for immersive virtual environments. Recent neural acoustic field methods enable spatially continuous sound rendering but remain scene-specific, requiring dense audio measurements and costly training for each environment. Few-shot approaches improve scalability across rooms but still rely on multiple recordings and, being deterministic, fail to capture the inherent uncertainty of scene acoustics under sparse context. We introduce flow-matching acoustic generation (FLAC), a probabilistic method for few-shot acoustic synthesis that models the distribution of plausible room impulse responses (RIRs) given minimal scene context. FLAC leverages a diffusion transformer trained with a flow-matching objective to generate RIRs at arbitrary positions in novel scenes, conditioned on spatial, geometric, and acoustic cues. FLAC outperforms state-of-the-art eight-shot baselines with one-shot on both the AcousticRooms and Hearing Anything Anywhere datasets. To complement standard perceptual metrics, we further introduce AGREE, a joint acoustic-geometry embedding, enabling geometry-consistent evaluation of generated RIRs through retrieval and distributional metrics. This work is the first to apply generative flow matching to explicit RIR synthesis, establishing a new direction for robust and data-efficient acoustic synthesis.

Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References89

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Few-shot Acoustic Synthesis with Multimodal Flow Matching

Related Papers