Corresponding author()Mar 29, 2026arXiv:2603.27819

KVSculpt: KV Cache Compression as Distillation

AI Summary

KVSculpt optimizes a smaller, unconstrained set of KV pairs in continuous embedding space to preserve attention behavior, rather than selecting or combining original cache entries. Keys are optimized using L-BFGS, and values are solved in closed form via least squares, with alternating steps, and an adaptive budget allocation is used to redistribute compression across layers and KV heads. Experiments on Qwen2.5-1.5B-Instruct show a 3.5-4.1x reduction in KL divergence compared to Select+Fit, with adaptive allocation providing an additional 1.3x reduction.

Key Contribution

Forget selecting or merging original KV pairs – KVSculpt distills the KV cache into a smaller, optimized representation in continuous embedding space, slashing KL divergence by up to 4.1x.

Abstract

KV cache compression is critical for efficient long-context LLM inference. Approaches that reduce the per-pair footprint -- quantization and low-rank decomposition -- are orthogonal to those that reduce the sequence length of the cache. Along the sequence-length dimension, existing methods range from pure eviction -- selecting which KV pairs to keep -- to merging, which combines similar pairs into fewer ones. Both remain anchored to the original cache entries. We propose KVSculpt, which moves to the other end of this spectrum: instead of selecting or combining original pairs, we optimize a smaller set of unconstrained KV pairs in continuous embedding space to preserve each layer's attention behavior. Keys are optimized via L-BFGS and values are solved in closed form via least squares, alternating every few steps. On top of this, we introduce adaptive budget allocation, which uses a cheap pilot compression run to redistribute the compression budget across layers and KV heads based on per-component difficulty. On Qwen2.5-1.5B-Instruct with 2048-token contexts, KVSculpt reduces KL divergence by 3.5-4.1x compared to Select+Fit -- attention-score eviction with least-squares value fitting -- across compression ratios r in {0.3, 0.5, 0.7}. Adaptive allocation provides an additional 1.3x KL reduction at no extra inference cost. Analysis reveals that compression difficulty is highly non-uniform: per-layer pilot MSE varies by up to 100x across layers, and the two KV heads within a single layer can differ by up to 467x -- demonstrating that fine-grained budget allocation is essential.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

KVSculpt: KV Cache Compression as Distillation

Related Papers