Feb 19, 2026arXiv:2602.17097

AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing

William Chen, Prem Seetharaman, Rithesh Kumar, Oriol Nieto, Shinji Watanabe, Justin Salamon, Zeyu Jin

AI Summary

The paper introduces AudioChat, a framework for audio foundation models designed to generate, edit, and understand complex multi-source audio scenes, termed "audio stories." AudioChat leverages LLM-based toolcalling agents to simulate user interactions, generating training data in the form of simulated dialogues. A novel Audio Transfusion Forcing objective is introduced to enable the model to decompose high-level instructions via chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation.

Key Contribution

AudioChat tackles the complexity of "audio stories" by using LLM-driven tool-calling agents to simulate user interactions, enabling audio foundation models to generate, edit, and understand complex multi-source acoustic scenes.

Abstract

Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation. To evaluate generation and editing performance, we develop three new metrics that directly measure task performance instead of relying upon distribution-based scoring. We highly encourage readers to visit our demo to better understand the capabilities of AudioChat: https://wanchichen.github.io/audiochat/.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AudioChat: Unified Audio Storytelling, Editing, and Understanding with Transfusion Forcing

Related Papers