Search papers, labs, and topics across Lattice.
PolyKV introduces a shared, asymmetrically compressed KV cache pool for multi-agent LLM inference, enabling multiple agents to access a single compressed cache via HuggingFace DynamicCache objects. Keys are quantized to int8 for softmax stability, while values undergo TurboQuant MSE compression, involving FWHT rotation and 3-bit Lloyd-Max quantization. Experiments on SmolLM2-1.7B and Llama-3-8B show PolyKV achieves a stable 2.91x compression ratio, reducing memory usage by up to 97.7% with minimal perplexity degradation and high BERTScore.
Squeeze your LLM inference costs: PolyKV slashes KV cache memory by up to 97% using a shared, compressed pool, with negligible impact on quality.
We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent -- the standard paradigm -- PolyKV writes a compressed cache once and injects it into N independent agent contexts via HuggingFace DynamicCache objects. Compression is asymmetric: Keys are quantized at int8 (q8_0) to preserve softmax stability, while Values are compressed using TurboQuant MSE -- a Fast Walsh-Hadamard Transform (FWHT) rotation followed by 3-bit Lloyd-Max quantization with centroids tuned to N(0,1). We evaluate across two model scales (SmolLM2-1.7B-Instruct and Llama-3-8B-Instruct), three context lengths (600-7,194 tokens), and up to 15 concurrent agents. PolyKV achieves a stable 2.91x compression ratio across all configurations. On Llama-3-8B with 15 agents sharing a 4K-token context, PolyKV reduces KV cache memory from 19.8 GB to 0.45 GB -- a 97.7% reduction -- while maintaining only +0.57% perplexity degradation and a mean BERTScore F1 of 0.928. PPL delta does not grow with agent count and improves as context length increases, inverting to -0.26% at 1,851 coherent tokens. To our knowledge, no prior work combines a single shared, lossy-compressed KV pool with multi-reader concurrent agent access.