HKUSTPolyUJun 8, 2026arXiv:2606.09508

From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

Zhanchao Xu, Qingfa Xiao, Fei Teng, Chen Jason Zhang, Lei Chen, Qing Li

AI Summary

This paper introduces EntropyInfer, a training-free framework that dynamically allocates computational resources for long-context LLMs based on the observed entropy patterns of attention heads. By distinguishing between Rigid Heads and Dynamic Heads, the method adapts compute allocation during prefilling and employs a novel latent KV cache compression scheme that utilizes generated output tokens. Extensive experiments demonstrate that EntropyInfer achieves up to 2.39× speedup beyond 100k tokens while maintaining minimal quality degradation compared to traditional full attention methods.

Key Contribution

Dynamic allocation of compute resources based on attention entropy can yield significant speedups in long-context LLM inference without sacrificing quality.

Abstract

Existing sparse attention and KV cache compression methods for long-context LLM inference typically apply fixed sparsity patterns or uniform budgets across all attention heads, overlooking the substantial variation in attention behavior among heads and contexts. We observe two distinct entropy patterns among attention heads: Rigid Heads, whose entropy stays near zero across input segments, and Dynamic Heads, whose entropy fluctuates significantly. Crucially, the distribution of these types is context-dependent and cannot be predetermined offline. We therefore propose EntropyInfer, a training-free framework that uses attention entropy to adaptively allocate compute at the granularity of individual heads and segments during prefilling. For decoding, we introduce a latent KV cache compression scheme that leverages generated output tokens, rather than prefill tokens alone, to identify and retain the most critical cache entries. Extensive experiments on Llama, Qwen and openPangu model series show that EntropyInfer consistently outperforms baselines including SnapKV, AdaKV, and CritiPrefill, achieving up to 2.39$\times$ end-to-end speedup beyond 100k tokens with minimal quality degradation compared to full attention. The code is released in https://github.com/SHA-4096/EntropyInfer.

Inference & Quantization Scaling Laws & Emergent Abilities

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs

Related Papers