Feb 15, 2026arXiv:2602.14209

MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM

Omin Kwon, Yeonjae Kim, Minseo Kim, Yeonhong Park, Jae W. Lee

AI Summary

The paper introduces MAGE, a novel sparse attention mechanism tailored for block diffusion LLMs that addresses the KV caching bottleneck in long-context scenarios. MAGE leverages the attention patterns from the first All-[MASK] denoising step to accurately predict important KV entries, enabling a single exact attention pass per block. Experiments on long-context benchmarks demonstrate that MAGE achieves near-lossless accuracy with significantly reduced KV budget and up to 3-4x speedup compared to autoregressive-oriented sparse attention baselines, further enhanced by lightweight fine-tuning.

Key Contribution

Block diffusion LLMs can achieve near-lossless long-context performance with 3-4x speedups by using attention patterns learned from the first denoising step, unlocking efficient sparse attention without autoregressive approximations.

Abstract

Block diffusion LLMs are emerging as a promising next paradigm for language generation, but their use of KV caching makes memory access a dominant bottleneck in long-context settings. While dynamic sparse attention has been actively explored, existing methods designed for autoregressive LLMs rely on approximate importance estimation and perform poorly when adapted to block diffusion. This work identifies a key opportunity unique to block diffusion: attention at the first All-[MASK] denoising step reliably predicts important KV entries and budget requirements, enabling MAGE to perform a single exact attention pass per block and reuse it for training-free sparse denoising. Across long-context benchmarks including LongBench and Needle-in-a-Haystack, MAGE achieves near-lossless accuracy with a fraction of the KV budget while delivering up to 3-4x end-to-end speedup, consistently outperforming AR-oriented sparse attention baselines. A lightweight fine-tuning strategy further strengthens [MASK]-guided patterns with minimal cost, requiring only a few hours of training on a single NVIDIA H100 GPU for both 1.5B and 7B models.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MAGE: All-[MASK] Block Already Knows Where to Look in Diffusion LLM

Related Papers