Apple MLBerkeley UniversityInstitut National de la RechercheFeb 25, 2026arXiv:2602.21472

The Design Space of Tri-Modal Masked Diffusion Models

Louis Bethune, L. Béthune, Victor Turrisi, V. Turrisi, Bruno Mlodozeniec, Bruno Kacper Mlodozeniec, Pau Rodriguez Lopez, Pau Rodríguez López, Lokesh Boominathan, Lokesh Boominathan, Nikhil Bhendawade, Nikhil Bhendawade, Amitis Shidani, Amitis Shidani, Joris Pelemans, Joris Pelemans, Theo X. Olausson, Theo X. Olausson, Devon Hjelm, Devon Hjelm, Paul Dixon, Paul Dixon, Joao Monteiro, João Monteiro, Pierre Ablin, Pierre Ablin, Vishnu Banna, Vishnu Banna, Arno Blaas, Arno Blaas, Nick Henderson, Nick Henderson, K. Noriy, Kari Noriy, Dan Busbridge, D. Busbridge, Joshua M. Susskind, Josh Susskind, Marco Cuturi, Marco Cuturi, I. Belousova, Irina Belousova, L. Zappella, Luca Zappella, Russ Webb, Russ Webb, Jason Ramapuram, Jason Ramapuram

AI Summary

The paper introduces a tri-modal masked diffusion model trained from scratch on text, image-text, and audio-text data, contrasting with prior work that fine-tunes unimodal models for bimodal generation. They conduct a systematic analysis of multimodal scaling laws, modality mixing ratios, noise schedules, and batch-size effects, leading to optimized inference sampling defaults. A novel SDE-based reparameterization is introduced to decouple physical and logical batch sizes, removing the need for batch size tuning.

Key Contribution

Tri-modal masked diffusion models can now be trained from scratch, achieving strong results in text generation, text-to-image, and text-to-speech, thanks to a systematic exploration of the design space and a novel SDE-based batch size reparameterization.

Abstract

Discrete diffusion models have emerged as strong alternatives to autoregressive language models, with recent work initializing and fine-tuning a base unimodal model for bimodal generation. Diverging from previous approaches, we introduce the first tri-modal masked diffusion model pretrained from scratch on text, image-text, and audio-text data. We systematically analyze multimodal scaling laws, modality mixing ratios, noise schedules, and batch-size effects, and we provide optimized inference sampling defaults. Our batch-size analysis yields a novel stochastic differential equation (SDE)-based reparameterization that eliminates the need for tuning the optimal batch size as reported in recent work. This reparameterization decouples the physical batch size, often chosen based on compute constraints (GPU saturation, FLOP efficiency, wall-clock time), from the logical batch size, chosen to balance gradient variance during stochastic optimization. Finally, we pretrain a preliminary 3B-parameter tri-modal model on 6.4T tokens, demonstrating the capabilities of a unified design and achieving strong results in text generation, text-to-image tasks, and text-to-speech tasks. Our work represents the largest-scale systematic open study of multimodal discrete diffusion models conducted to date, providing insights into scaling behaviors across multiple modalities.

Multimodal Models Scaling Laws & Emergent Abilities Speech & Audio

Citation Metrics

Citations0

Influential citations0

References102

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Design Space of Tri-Modal Masked Diffusion Models

Related Papers