Search papers, labs, and topics across Lattice.
This paper introduces RedKnot, a novel head-aware KV cache management system designed to optimize long-context large language model (LLM) serving by decomposing the KV cache along attention heads. By recognizing that different heads have varying functional roles and importance, RedKnot enables more efficient cache reuse, compression, and management strategies tailored to specific serving scenarios. The key result demonstrates significant improvements in resource efficiency and scalability without necessitating model retraining or fine-tuning, thereby addressing critical bottlenecks in AI infrastructure.
Transforming the KV cache from a monolithic structure into a dynamic, head-aware system could revolutionize LLM serving efficiency and scalability.
As the input length of large language model (LLM) serving continues to grow, the KV cache has become a dominant bottleneck in AI infrastructure. It limits GPU memory capacity, serving concurrency, cache reuse, and distributed scalability. Several important problems, including position-independent KV cache, prefix KV cache compression, hot/cold KV cache separation, and distributed KV cache management, all depend on how the KV cache is represented and managed. However, existing serving systems largely rely on a monolithic KV cache abstraction, where the KV cache is treated as a homogeneous sequence of token-level memory blocks and managed with similar policies across attention heads and serving scenarios. We observe that KV cache utility is highly structured across KV heads: different heads exhibit different functional roles, attention distances, and runtime importance. Therefore, a full KV cache is not always necessary for every head, token range, or serving scenario. We present RedKnot, a head-aware KV cache management system for LLM serving. RedKnot breaks the conventional monolithic KV cache abstraction by decomposing the KV cache along KV heads, whose importance and effective attention ranges vary significantly across serving scenarios. This head-level decomposition turns the KV cache from a monolithic tensor abstraction into a structured memory object, enabling RedKnot to uniformly support position-independent KV reuse, prefix KV compression, hot/cold KV separation, and distributed KV placement while preserving output fidelity and improving resource efficiency, without requiring model retraining or fine-tuning. RedKnot establishes a new foundation for AI infrastructure by transforming the KV cache from a monolithic, passive runtime artifact into a dynamic, model-aware runtime substrate for scalable LLM serving.