HITMar 9, 2026arXiv:2603.08453

LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing

Dongfang Li, Zixuan Liu, Gang Lin, Baotian Hu, Min Zhang

AI Summary

LycheeCluster is introduced as a novel method to address the computational and memory challenges of long-context LLMs by using boundary-aware chunking to preserve semantic coherence and a hierarchical index based on the triangle inequality for efficient KV cache management. This approach transforms the KV cache retrieval from a linear scan to a logarithmic-time pruning process, enabling faster inference. Experiments show that LycheeCluster achieves up to 3.6x end-to-end inference speedup compared to existing KV cache management methods with minimal performance degradation.

Key Contribution

Get 3.6x faster long-context LLM inference with LycheeCluster's hierarchical KV indexing, which avoids the semantic fragmentation of naive chunking.

Abstract

The quadratic complexity of the attention mechanism and the substantial memory footprint of the Key-Value (KV) cache present severe computational and memory challenges for Large Language Models (LLMs) processing long contexts. Existing retrieval-based methods often compromise semantic integrity through fixed-size chunking and suffer from inefficient linear scanning. In this paper, we propose LycheeCluster, a novel method for efficient KV cache management. LycheeCluster preserves local semantic coherence via boundary-aware chunking and constructs a recursive hierarchical index rooted in the triangle inequality. This design transforms cache retrieval from a linear scan into a theoretically bounded, logarithmic-time pruning process, while a lazy update strategy supports efficient streaming generation. Experiments demonstrate that LycheeCluster achieves up to a 3.6x end-to-end inference speedup with negligible degradation in model performance, outperforming state-of-the-art KV cache management methods (e.g., Quest, ClusterKV). We will release our code and kernels after publication.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

LycheeCluster: Efficient Long-Context Inference with Structure-Aware Chunking and Hierarchical KV Indexing

Related Papers