NUSCorrespoding AuthorApr 9, 2026arXiv:2604.08302

DMax: Aggressive Parallel Decoding for dLLMs

Zigeng Chen, Gongfan Fang, Xinyin Ma, Ruonan Yu, Xinchao Wang

AI Summary

DMax, a new decoding paradigm for diffusion language models (dLLMs), addresses error accumulation in parallel decoding by reformulating it as a progressive self-refinement from mask embeddings to token embeddings. The method introduces On-Policy Uniform Training to unify masked and uniform dLLMs, allowing the model to recover from both masked inputs and its own errors. Experiments show DMax significantly improves tokens per frame (TPF) on GSM8K and MBPP while maintaining accuracy, achieving 1,338 TPS on two H200 GPUs.

Key Contribution

DMax unlocks faster diffusion language model decoding by reframing the process as iterative self-correction in embedding space, achieving up to 2x speedup without sacrificing accuracy.

Abstract

We present DMax, a new paradigm for efficient diffusion language models (dLLMs). It mitigates error accumulation in parallel decoding, enabling aggressive decoding parallelism while preserving generation quality. Unlike conventional masked dLLMs that decode through a binary mask-to-token transition, DMax reformulates decoding as a progressive self-refinement from mask embeddings to token embeddings. At the core of our approach is On-Policy Uniform Training, a novel training strategy that efficiently unifies masked and uniform dLLMs, equipping the model to recover clean tokens from both masked inputs and its own erroneous predictions. Building on this foundation, we further propose Soft Parallel Decoding. We represent each intermediate decoding state as an interpolation between the predicted token embedding and the mask embedding, enabling iterative self-revising in embedding space. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of DMax. Compared with the original LLaDA-2.0-mini, our method improves TPF on GSM8K from 2.04 to 5.47 while preserving accuracy. On MBPP, it increases TPF from 2.71 to 5.86 while maintaining comparable performance. On two H200 GPUs, our model achieves an average of 1,338 TPS at batch size 1. Code is available at: https://github.com/czg1225/DMax

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References106

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DMax: Aggressive Parallel Decoding for dLLMs

Related Papers