Search papers, labs, and topics across Lattice.
KVSculpt optimizes a smaller, unconstrained set of KV pairs in continuous embedding space to preserve attention behavior, rather than selecting or combining original cache entries. Keys are optimized using L-BFGS, and values are solved in closed form via least squares, with alternating steps, and an adaptive budget allocation is used to redistribute compression across layers and KV heads. Experiments on Qwen2.5-1.5B-Instruct show a 3.5-4.1x reduction in KL divergence compared to Select+Fit, with adaptive allocation providing an additional 1.3x reduction.
Forget selecting or merging original KV pairs – KVSculpt distills the KV cache into a smaller, optimized representation in continuous embedding space, slashing KL divergence by up to 4.1x.
KV cache compression is critical for efficient long-context LLM inference. Approaches that reduce the per-pair footprint -- quantization and low-rank decomposition -- are orthogonal to those that reduce the sequence length of the cache. Along the sequence-length dimension, existing methods range from pure eviction -- selecting which KV pairs to keep -- to merging, which combines similar pairs into fewer ones. Both remain anchored to the original cache entries. We propose KVSculpt, which moves to the other end of this spectrum: instead of selecting or combining original pairs, we optimize a smaller set of unconstrained KV pairs in continuous embedding space to preserve each layer's attention behavior. Keys are optimized via L-BFGS and values are solved in closed form via least squares, alternating every few steps. On top of this, we introduce adaptive budget allocation, which uses a cheap pilot compression run to redistribute the compression budget across layers and KV heads based on per-component difficulty. On Qwen2.5-1.5B-Instruct with 2048-token contexts, KVSculpt reduces KL divergence by 3.5-4.1x compared to Select+Fit -- attention-score eviction with least-squares value fitting -- across compression ratios r in {0.3, 0.5, 0.7}. Adaptive allocation provides an additional 1.3x KL reduction at no extra inference cost. Analysis reveals that compression difficulty is highly non-uniform: per-layer pilot MSE varies by up to 100x across layers, and the two KV heads within a single layer can differ by up to 467x -- demonstrating that fine-grained budget allocation is essential.