Search papers, labs, and topics across Lattice.
This paper introduces SparseX, a novel method for segment-level KV Cache sharing that enhances long-context LLM serving by addressing inefficiencies in conventional cache mechanisms. By utilizing contiguous token segments and Sparse-Q indices, SparseX enables effective reuse of key tokens across complex interleaved requests while maintaining contextual integrity through Sparse-KV Recomputation within a single forward pass. The implementation of a hybrid attention mode further optimizes performance by balancing full attention in early layers with sparse recomputation in later layers, resulting in improved efficiency and quality for various online serving scenarios.
SparseX achieves efficient KV Cache sharing for LLMs, restoring contextual interactions without the overhead of additional models or preprocessing.
In long-context LLM serving, the prefill stage often dominates time-to-first-token and computational cost. Although Prefix Cache in vLLM/PagedAttention has been widely used to reuse identical prompt prefixes, repeated content in practical applications frequently appears as non-prefix, cross-request, cross-turn, and cross-agent segments, which makes conventional cache mechanisms insufficient. This paper presents SparseX, a segment-level KV Cache sharing method for common serving scenarios. SparseX uses contiguous token segments as reuse units and exploits Sparse-Q indices that naturally arise in KV Cache reuse workloads to estimate the key tokens that require correction. Based on this estimate, SparseX performs Sparse-KV Recomputation within a single forward pass, thereby restoring cross-segment contextual interactions under complex interleaved reuse patterns while avoiding additional models or separate preprocessing stages for token selection. SparseX further implements a full+sparse hybrid attention mode based on a layer-specific threshold: early layers retain full attention to obtain a more stable token-importance signal, and later layers switch to sparse recomputation to improve reuse quality on complex long-context tasks. We implement SparseX-vLLM on top of vLLM, integrating segment-level cache lookup, PagedAttention management, RoPE alignment, Sparse-Q token selection, and FlashAttention backends into a unified execution path. SparseX is model-agnostic, training-free, and compatible with Prefix Cache, and it provides unified support for common online serving scenarios including multi-round chat, retrieval-augmented generation (RAG), and agent workflows.