Georgia TechFeb 18, 2026arXiv:2602.16687

Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

Potsawee Manakul, Potsawee Manakul, Woody Haosheng Gan, Woody Haosheng Gan, Martijn Bartelds, Martijn Bartelds, William B. Held, William Held, Diyi Yang

AI Summary

The paper introduces SODA (Scaling Open Discrete Audio), a suite of native audio foundation models trained via next-token prediction on interleaved semantic, acoustic, and text tokens. Through a large-scale IsoFLOP analysis across 64 models, the authors derive scaling laws for discrete audio models, finding that optimal data size scales 1.6x faster than optimal model size. They demonstrate SODA's versatility by fine-tuning it for voice-preserving speech-to-speech translation, showcasing its potential as a unified architecture for diverse audio/text tasks.

Key Contribution

Forget text-first: SODA models show that scaling native audio foundation models with interleaved semantic, acoustic, and text tokens unlocks powerful audio generation and cross-modal capabilities.

Abstract

Current audio language models are predominantly text-first, either extending pre-trained text LLM backbones or relying on semantic-only audio tokens, limiting general audio modeling. This paper presents a systematic empirical study of native audio foundation models that apply next-token prediction to audio at scale, jointly modeling semantic content, acoustic details, and text to support both general audio generation and cross-modal capabilities. We provide comprehensive empirical insights for building such models: (1) We systematically investigate design choices -- data sources, text mixture ratios, and token composition -- establishing a validated training recipe. (2) We conduct the first scaling law study for discrete audio models via IsoFLOP analysis on 64 models spanning $3{\times}10^{18}$ to $3{\times}10^{20}$ FLOPs, finding that optimal data grows 1.6$\times$ faster than optimal model size. (3) We apply these lessons to train SODA (Scaling Open Discrete Audio), a suite of models from 135M to 4B parameters on 500B tokens, comparing against our scaling predictions and existing models. SODA serves as a flexible backbone for diverse audio/text tasks -- we demonstrate this by fine-tuning for voice-preserving speech-to-speech translation, using the same unified architecture.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References41

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Scaling Open Discrete Audio Foundation Models with Interleaved Semantic, Acoustic, and Text Tokens

Related Papers