Search papers, labs, and topics across Lattice.
This paper introduces Prefilling-dLLM, a novel framework that optimizes the efficiency of diffusion large language models (dLLMs) by partitioning input prefixes into chunks and caching their key-value representations. By leveraging intra-chunk token sparsity and selecting the most relevant chunks for decoding, the method reduces computational complexity from quadratic in the full sequence length to quadratic only in the decoding length. Experimental results on LongBench and InfiniteBench demonstrate that Prefilling-dLLM achieves state-of-the-art performance in dLLM acceleration, with speedups of 9.1 to 28.0 times for contexts ranging from 8K to 32K.
Sparse prefilling can dramatically accelerate long-context inference in diffusion language models, achieving up to 28x speedup without sacrificing quality.
Diffusion large language models (dLLMs) re-encode the entire prefix at every denoising step, causing recomputation that scales quadratically with context length and becomes prohibitive for long-context scenarios. We propose Prefilling-dLLM, a training-free prefill-decode disaggregation framework for dLLMs that partitions the prefix into N chunks, caches their KV representations once, and selects the top-K most relevant chunks with intra-chunk token sparsity for decoding, showing that sparse prefilling can outperform dense attention while reducing per-step complexity from quadratic in the full sequence length to quadratic only in the decode length. On LongBench and InfiniteBench, Prefilling-dLLM achieves state-of-the-art quality among dLLM acceleration methods, and an attention kernel that parallelizes decoding over the non-contiguously cached chunk KV yields 9.1--28.0x speedup at 8K--32K contexts. We further show that beginning-of-sequence tokens prepended to each chunk act as periodic attention anchors that eliminate the lost-in-the-middle phenomenon. Code is available at https://github.com/menik1126/Prefilling-dLLM.