Feb 26, 2026arXiv:2602.23225

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Dilxat Muhtar, Dilxat Muhtar, Tianlong Chen, Shiwei Liu, Shiwei Liu

AI Summary

The paper investigates why Diffusion Language Models (DLMs) tend to exhibit autoregressive-like decoding behavior despite their potential for parallel token generation. It posits that a key reason is the mismatch between DLM objectives and the sequential structure of training data, including standard pretraining corpora and chain-of-thought (CoT) supervision. To address this, the authors introduce NAP (Non-Autoregressive Parallel DLMs), a data-centric approach that curates training examples as multiple independent reasoning trajectories and uses a parallel-forced decoding strategy.

Key Contribution

DLMs aren't truly parallel because their training data is too sequential, but NAP shows how data curation can unlock genuine parallel decoding and boost reasoning performance.

Abstract

Diffusion Language Models (DLMs) are often advertised as enabling parallel token generation, yet practical fast DLMs frequently converge to left-to-right, autoregressive (AR)-like decoding dynamics. In contrast, genuinely non-AR generation is promising because it removes AR's sequential bottleneck, better exploiting parallel hardware to reduce synchronization/communication overhead and improve latency scaling with output length. We argue that a primary driver of AR-like decoding is a mismatch between DLM objectives and the highly sequential structure of widely used training data, including standard pretraining corpora and long chain-of-thought (CoT) supervision. Motivated by this diagnosis, we propose NAP (Non-Autoregressive Parallel DLMs), a proof-of-concept, data-centric approach that better aligns supervision with non-AR parallel decoding. NAP curates examples as multiple independent reasoning trajectories and couples them with a parallel-forced decoding strategy that encourages multi-token parallel updates. Across math reasoning benchmarks, NAP yields stronger performance under parallel decoding than DLMs trained on standard long CoT data, with gains growing as parallelism increases. Our results suggest that revisiting data and supervision is a principled direction for mitigating AR-like behavior and moving toward genuinely non-autoregressive parallel generation in DLMs. Our code is available at https://github.com/pixeli99/NAP.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References45

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Why Diffusion Language Models Struggle with Truly Parallel (Non-Autoregressive) Decoding?

Related Papers