Microsoft ResearchGuangzhou City PolytechnicHKUSTMay 25, 2026arXiv:2605.25475

IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference

Xintong Yang, Hao Gu, Binxing Xu, Lujun Li, Bei Liu, Jiacheng Liu, Qiyuan Zhu, Sirui Han, Yike Guo

AI Summary

The paper introduces IndexMem, a learned KV-cache eviction policy for long-context LLM inference that uses a learnable indexer to predict KV importance and a latent memory module to compress and retain information from evicted tokens. This approach mitigates the linear growth of the KV cache with sequence length, improving long-context performance under a bounded KV budget. Experiments across Qwen, Mistral, and Llama models demonstrate consistent improvements on RULER (up to 25 points under aggressive eviction), more stable Needle-in-a-Haystack retrieval, and superior LongBench scores compared to existing eviction policies.

Key Contribution

LLMs can maintain long-context performance even with aggressive KV-cache eviction by learning to predict token importance and compressing evicted tokens into a latent memory.

Abstract

Large Language Models (LLMs) are increasingly expected to operate over long contexts, yet standard softmax attention incurs a KV cache that grows linearly with sequence length, quickly becoming the bottleneck for long context inference. A practical remedy is to evict less important KV entries; however, existing eviction policies are largely heuristic and struggle to capture the rich, input-dependent distribution of token importance. In this work, we introduce a learnable indexer that predicts KV importance, enabling more accurate retention of critical tokens. Meanwhile, naively evicting tokens permanently discards their information, leading to irreversible forgetting and degraded retrieval over long ranges. To address this, we propose a lightweight latent memory module that compresses evicted tokens into a compact, online-updated state and provides residual readouts to compensate for the attention contributions lost through KV eviction. Collectively, our method enables accurate long-context inference under a bounded KV budget, delivering consistent improvements on RULER (4K/16K) across Qwen, Mistral, and Llama models (up to 25 points under aggressive eviction), markedly more stable Needle-in-a-Haystack retrieval, and superior LongBench scores and compression curves compared to existing eviction policies.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

IndexMem: Learned KV-Cache Eviction with Latent Memory for Long-Context LLM Inference

Related Papers