Mar 19, 2026arXiv:2603.18489

EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

Minsoo Cheong, Minsoo Cheong, Donghyun Son, Donghyun Son, Woosang Lim, Sungjoo Yoo, Sungjoo Yoo

AI Summary

The paper introduces EntropyCache, a training-free KV caching method for diffusion language models (dLLMs) that leverages decoded token entropy as a proxy for cache staleness. It selectively recomputes KV cache entries based on the maximum entropy of newly decoded token distributions, focusing on the $k$ most recently decoded tokens. Experiments with LLaDA-8B-Instruct and Dream-7B-Instruct demonstrate significant speedups (15.2x-26.4x on standard benchmarks and 22.4x-24.1x on chain-of-thought) with minimal accuracy loss and negligible overhead.

Key Contribution

Diffusion language models can achieve up to 26x inference speedups with almost no accuracy loss, thanks to a clever entropy-based KV caching strategy that avoids costly full forward passes.

Abstract

Diffusion-based large language models (dLLMs) rely on bidirectional attention, which prevents lossless KV caching and requires a full forward pass at every denoising step. Existing approximate KV caching methods reduce this cost by selectively updating cached states, but their decision overhead scales with context length or model depth. We propose EntropyCache, a training-free KV caching method that uses the maximum entropy of newly decoded token distributions as a constant-cost signal for deciding when to recompute. Our design is grounded in two empirical observations: (1) decoded token entropy correlates with KV cache drift, providing a cheap proxy for cache staleness, and (2) feature volatility of decoded tokens persists for multiple steps after unmasking, motivating recomputation of the $k$ most recently decoded tokens. The skip-or-recompute decision requires only $O(V)$ computation per step, independent of context length and model scale. Experiments on LLaDA-8B-Instruct and Dream-7B-Instruct show that EntropyCache achieves $15.2\times$-$26.4\times$ speedup on standard benchmarks and $22.4\times$-$24.1\times$ on chain-of-thought benchmarks, with competitive accuracy and decision overhead accounting for only $0.5\%$ of inference time. Code is available at https://github.com/mscheong01/EntropyCache.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References24

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EntropyCache: Decoded Token Entropy Guided KV Caching for Diffusion Language Models

Related Papers