Tsinghua AIMar 12, 2026arXiv:2603.12201

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Yushi Bai, Yu Bai, Qian Dong, Tingyu Jiang, Ting Jiang, Xin Lv, Zhengxiao Du, Aohan Zeng, Jie Tang, Juanzi Li

AI Summary

The paper introduces IndexCache, a method to accelerate sparse attention mechanisms like DeepSeek Sparse Attention (DSA) by reusing top-k index selections across consecutive layers. IndexCache partitions layers into Full layers with independent indexers and Shared layers that reuse indices from the nearest Full layer, reducing redundant computations. They propose both a training-free greedy search and a training-aware distillation approach to optimize the configuration of Full and Shared layers, achieving significant speedups with minimal quality loss.

Key Contribution

Cut sparse attention indexing costs by 75% without sacrificing quality by intelligently reusing indices across layers.

Abstract

Long-context agentic workflows have emerged as a defining use case for large language models, making attention efficiency critical for both inference speed and serving cost. Sparse attention addresses this challenge effectively, and DeepSeek Sparse Attention (DSA) is a representative production-grade solution: a lightweight lightning indexer selects the top-k most relevant tokens per query, reducing core attention from $O(L^2)$ to $O(Lk)$. However, the indexer itself retains $O(L^2)$ complexity and must run independently at every layer, despite the fact that the resulting top-k selections are highly similar across consecutive layers. We present IndexCache, which exploits this cross-layer redundancy by partitioning layers into a small set of Full layers that run their own indexers and a majority of Shared layers that simply reuse the nearest Full layer's top-k indices. We propose two complementary approaches to determine and optimize this configuration. Training-free IndexCache applies a greedy search algorithm that selects which layers to retain indexers by directly minimizing language modeling loss on a calibration set, requiring no weight updates. Training-aware IndexCache introduces a multi-layer distillation loss that trains each retained indexer against the averaged attention distributions of all layers it serves, enabling even simple interleaved patterns to match full-indexer accuracy. Experimental results on a 30B DSA model show that IndexCache can remove 75% of indexer computations with negligible quality degradation, achieving up to 1.82$\times$ prefill speedup and 1.48$\times$ decode speedup compared to standard DSA. These positive results are further confirmed by our preliminary experiments on the production-scale GLM-5 model (Figure 1).

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References55

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

IndexCache: Accelerating Sparse Attention via Cross-Layer Index Reuse

Related Papers