Grinnell CollegeLambdaRiceWorkatoJun 22, 2026arXiv:2606.23961

Forget Without Compromise: Nexus Sampling for Streaming KV-Cache Eviction Under Fixed Budgets

Duc Duong, Hoang Anh Duy Le, Jianwen Xie, Anshumali Shrivastava, Zhaozhuo Xu

AI Summary

This paper introduces Nexus Sampling, a novel method for managing KV cache eviction in long-context LLM workloads under fixed memory budgets. By combining an iterative scoring mechanism that identifies important tokens with weighted reservoir sampling, Nexus Sampling significantly improves the retention of subtly important tokens compared to traditional deterministic top-$K$ methods. Empirical results demonstrate that, even with 80% cache eviction, Nexus Sampling achieves performance within 1% of dense attention on LongBench and outperforms top-$K$ baselines on retrieval-heavy tasks while requiring up to 10x less memory per sequence.

Key Contribution

Nexus Sampling retains crucial tokens during KV cache eviction, achieving near-dense attention performance with dramatically reduced memory usage.

Abstract

Long-context and agentic LLM workloads push the KV cache past any fixed memory budget, forcing the inference stack to permanently evict tokens at every step of a continuous-inference stream. Existing methods all share the same template, a per-step direct-attention score followed by deterministic top-$K$ selection, which converts a single below-cutoff step into an irreversible verdict and permanently erases any subtly important token that direct attention cannot single out from noise. To address this challenge, we propose Nexus Sampling, a training-free eviction method that pairs Nexus scoring, an iterative walk over direct attention that surfaces bridge tokens, with weighted reservoir sampling, which retains tokens with inclusion probability in place of deterministic top-$K$. Theoretically, we show that Nexus Sampling dominates deterministic top-$K$ in long-run survival of subtly important tokens. Empirically, at 80% KV cache eviction, Nexus Sampling matches dense attention within 1% on LongBench while outperforming top-$K$ baselines on retrieval-heavy tasks, with up to 10x smaller per-sequence cache memory.

Inference & Quantization Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Forget Without Compromise: Nexus Sampling for Streaming KV-Cache Eviction Under Fixed Budgets

Related Papers