UC RiversideUCRFeb 18, 2026arXiv:2602.16169

Discrete Stochastic Localization for Non-autoregressive Generation

Yunshu Wu, Yunshu Wu, Jiayi Cheng, P. Thakuria, Partha Thakuria, Rob Brekelmans, Rob Brekelmans, Evangelos E. Papalexakis, Evangelos E. Papalexakis, Greg Ver Steeg, G. V. Steeg

AI Summary

The paper introduces Discrete Stochastic Localization (DSL), a training technique for masked diffusion language models (MDLMs) that improves the step-efficiency of non-autoregressive (NAR) generation. DSL trains a single SNR-invariant denoiser across a continuum of noise levels, effectively bridging intermediate draft noise and mask-style endpoint corruption within a Diffusion Transformer. Experiments on OpenWebText demonstrate that DSL fine-tuning achieves significant MAUVE gains with fewer denoiser evaluations compared to MDLM+ReMDM, while also improving self-correction and uncertainty calibration.

Key Contribution

Train smarter, not harder: DSL unlocks 4x faster non-autoregressive generation by teaching masked diffusion models to self-correct more efficiently.

Abstract

Non-autoregressive (NAR) generation reduces decoding latency by predicting many tokens in parallel, but iterative refinement often suffers from error accumulation and distribution shift under self-generated drafts. Masked diffusion language models (MDLMs) and their remasking samplers (e.g., ReMDM) can be viewed as modern NAR iterative refinement, where generation repeatedly revises a partially observed draft. In this work we show that \emph{training alone} can substantially improve the step-efficiency of MDLM/ReMDM sampling. We propose \textsc{DSL} (Discrete Stochastic Localization), which trains a single SNR-invariant denoiser across a continuum of corruption levels, bridging intermediate draft noise and mask-style endpoint corruption within one Diffusion Transformer. On OpenWebText, \textsc{DSL} fine-tuning yields large MAUVE gains at low step budgets, surpassing the MDLM+ReMDM baseline with $\sim$4$\times$ fewer denoiser evaluations, and matches autoregressive quality at high budgets. Analyses show improved self-correction and uncertainty calibration, making remasking markedly more compute-efficient.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References36

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Discrete Stochastic Localization for Non-autoregressive Generation

Related Papers