Feb 16, 2026arXiv:2602.15014

Scaling Beyond Masked Diffusion Language Models

Subham Sekhar Sahoo, Jean-Marie Lemercier, Zhihan Yang, Justin Deschenaux, Jingyu Liu, John Thickstun

AI Summary

This paper presents a scaling law study comparing uniform-state and interpolating discrete diffusion language models, challenging the dominance of Masked diffusion. They demonstrate that Masked diffusion models can be made more FLOPs-efficient with a cross-entropy objective and show that perplexity is not a reliable metric for cross-family comparison due to varying sampling speeds. Scaling models to 1.7B parameters, the study reveals that uniform-state diffusion remains competitive on likelihood benchmarks and surpasses autoregressive and Masked diffusion models on GSM8K, despite exhibiting worse perplexity.

Key Contribution

Uniform-state diffusion models, often overlooked in favor of masked diffusion, surprisingly outperform autoregressive and masked diffusion models on GSM8K when scaled to 1.7B parameters, despite worse perplexity.

Abstract

Diffusion language models are a promising alternative to autoregressive models due to their potential for faster generation. Among discrete diffusion approaches, Masked diffusion currently dominates, largely driven by strong perplexity on language modeling benchmarks. In this work, we present the first scaling law study of uniform-state and interpolating discrete diffusion methods. We also show that Masked diffusion models can be made approximately 12% more FLOPs-efficient when trained with a simple cross-entropy objective. We find that perplexity is informative within a diffusion family but can be misleading across families, where models with worse likelihood scaling may be preferable due to faster and more practical sampling, as reflected by the speed-quality Pareto frontier. These results challenge the view that Masked diffusion is categorically the future of diffusion language modeling and that perplexity alone suffices for cross-algorithm comparison. Scaling all methods to 1.7B parameters, we show that uniform-state diffusion remains competitive on likelihood-based benchmarks and outperforms autoregressive and Masked diffusion models on GSM8K, despite worse validation perplexity. We provide the code, model checkpoints, and video tutorials on the project page: http://s-sahoo.github.io/scaling-dllms

Natural Language Processing Scaling Laws & Emergent Abilities Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Scaling Beyond Masked Diffusion Language Models

Related Papers