Independent ResearcherApr 27, 2026arXiv:2604.24971

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

Ishan Patel, Ishan Patel, Ishan Joshi, Ishan Joshi

AI Summary

PolyKV introduces a shared, asymmetrically compressed KV cache pool for multi-agent LLM inference, enabling multiple agents to access a single compressed cache via HuggingFace DynamicCache objects. Keys are quantized to int8 for softmax stability, while values undergo TurboQuant MSE compression, involving FWHT rotation and 3-bit Lloyd-Max quantization. Experiments on SmolLM2-1.7B and Llama-3-8B show PolyKV achieves a stable 2.91x compression ratio, reducing memory usage by up to 97.7% with minimal perplexity degradation and high BERTScore.

Key Contribution

Squeeze your LLM inference costs: PolyKV slashes KV cache memory by up to 97% using a shared, compressed pool, with negligible impact on quality.

Abstract

We present PolyKV, a system in which multiple concurrent inference agents share a single, asymmetrically compressed KV cache pool. Rather than allocating a separate KV cache per agent -- the standard paradigm -- PolyKV writes a compressed cache once and injects it into N independent agent contexts via HuggingFace DynamicCache objects. Compression is asymmetric: Keys are quantized at int8 (q8_0) to preserve softmax stability, while Values are compressed using TurboQuant MSE -- a Fast Walsh-Hadamard Transform (FWHT) rotation followed by 3-bit Lloyd-Max quantization with centroids tuned to N(0,1). We evaluate across two model scales (SmolLM2-1.7B-Instruct and Llama-3-8B-Instruct), three context lengths (600-7,194 tokens), and up to 15 concurrent agents. PolyKV achieves a stable 2.91x compression ratio across all configurations. On Llama-3-8B with 15 agents sharing a 4K-token context, PolyKV reduces KV cache memory from 19.8 GB to 0.45 GB -- a 97.7% reduction -- while maintaining only +0.57% perplexity degradation and a mean BERTScore F1 of 0.928. PPL delta does not grow with agent count and improves as context length increases, inverting to -0.26% at 1,851 coherent tokens. To our knowledge, no prior work combines a single shared, lossy-compressed KV pool with multi-reader concurrent agent access.

Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference

Related Papers