Ruhr University BochumApr 27, 2026arXiv:2604.24647

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

AI Summary

DepthKV introduces a layer-dependent KV cache pruning strategy for long-context LLM inference, addressing the memory bottleneck caused by the linearly growing KV cache. They demonstrate that layers exhibit varying sensitivity to pruning and propose allocating a fixed global KV budget based on this sensitivity, rather than using a uniform pruning ratio. Experiments across models and tasks show that DepthKV consistently outperforms uniform pruning, achieving better performance with the same global pruning ratio.

Key Contribution

Not all layers are created equal: pruning the KV cache in a layer-dependent manner significantly boosts long-context LLM performance compared to uniform pruning strategies.

Abstract

Long-context reasoning is a critical capability of large language models (LLMs), enabling applications such as long-document understanding, summarization, and code generation. However, efficient autoregressive inference relies on the key-value (KV) cache, whose memory footprint grows linearly with sequence length, leading to a major memory bottleneck. To mitigate this overhead, KV cache pruning methods discard cached tokens with low attention scores during inference. Most existing methods apply a uniform pruning ratio across layers, implicitly assuming that all layers contribute equally to overall model performance. We show that this assumption is suboptimal, as layers differ significantly in their sensitivity to pruning. We propose DepthKV, a layer-dependent pruning framework that allocates a fixed global KV budget across layers based on their sensitivity, rather than using a uniform allocation. Across multiple models and tasks, DepthKV consistently outperforms uniform pruning at the same global pruning ratio, demonstrating more effective utilization of the KV cache budget through layer-dependent allocation.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References41

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DepthKV: Layer-Dependent KV Cache Pruning for Long-Context LLM Inference

Related Papers