Mar 5, 2026arXiv:2603.05454

Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes

Pengxiang Li, Joey Tsai, Hongwei Xue, Kunyu Shi, Shilin Yan

AI Summary

The paper introduces the Longest Stable Prefix (LSP) scheduler, a novel inference paradigm for Diffusion Language Models (DLMs) that replaces scattered acceptance with monolithic prefix absorption. LSP identifies and commits contiguous, left-aligned blocks of stable tokens based on a single forward pass, snapping boundaries to linguistic delimiters. This approach yields up to 3.4x speedups on LLaDA-8B and Dream-7B across diverse benchmarks by converting fragmented KV cache updates into efficient contiguous appends and reducing token flip rates.

Key Contribution

Diffusion language models can achieve up to 3.4x faster inference without sacrificing quality by committing stable token prefixes instead of scattering accepted tokens.

Abstract

Diffusion Language Models (DLMs) promise highly parallel text generation, yet their practical inference speed is often bottlenecked by suboptimal decoding schedulers. Standard approaches rely on'scattered acceptance'-committing high confidence tokens at disjoint positions throughout the sequence. This approach inadvertently fractures the Key-Value (KV) cache, destroys memory locality, and forces the model into costly, repeated repairs across unstable token boundaries. To resolve this, we present the Longest Stable Prefix (LSP) scheduler, a training-free and model-agnostic inference paradigm based on monolithic prefix absorption. In each denoising step, LSP evaluates token stability via a single forward pass, dynamically identifies a contiguous left-aligned block of stable predictions, and snaps its boundary to natural linguistic or structural delimiters before an atomic commitment. This prefix-first topology yields dual benefits: systemically, it converts fragmented KV cache updates into efficient, contiguous appends; algorithmically, it preserves bidirectional lookahead over a geometrically shrinking active suffix, drastically reducing token flip rates and denoiser calls. Extensive evaluations on LLaDA-8B and Dream-7B demonstrate that LSP accelerates inference by up to 3.4x across rigorous benchmarks including mathematical reasoning, code generation, multilingual (CJK) tasks, and creative writing while matching or slightly improving output quality. By fundamentally restructuring the commitment topology, LSP bridges the gap between the theoretical parallelism of DLMs and practical hardware efficiency.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References30

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond Scattered Acceptance: Fast and Coherent Inference for DLMs via Longest Stable Prefixes

Related Papers