Search papers, labs, and topics across Lattice.
This paper introduces cross-layer sparse attention (CLSA), a novel approach that enhances long-context inference in LLMs by sharing the routing index across decoder layers while utilizing KV-sharing architectures. By computing token-level top-k selection only once and reusing the index, CLSA achieves significant improvements in decoding speed and overall throughput without sacrificing accuracy. Experiments demonstrate that CLSA can provide up to 7.6x speedup in decoding and 17.1x improvement in throughput at 128K context, addressing critical efficiency-quality trade-offs in existing sparse attention methods.
Achieving up to 7.6x faster decoding and 17.1x greater throughput, CLSA redefines efficiency in long-context LLMs without compromising accuracy.
Long-context inference in modern LLMs is increasingly constrained by decoding efficiency, especially in reasoning-heavy settings where models generate long intermediate chains of thought. Existing sparse attention methods often face a practical efficiency-quality trade-off. Structured block sparse methods typically provide stronger acceleration but incur noticeable quality loss, while token sparse methods are usually more accurate yet deliver limited end-to-end speedup because top-k routing over the full cache remains expensive. In this work, we propose cross-layer sparse attention (CLSA), which is built on top of KV-sharing architectures such as YOCO. The core idea is to share not only the KV cache across cross-decoder layers, but also the routing index. A single indexer computes token-level top-k selection once and reuses the resulting index across layers, thereby preserving the fine-grained selectivity of token sparse attention while amortizing the routing overhead. The resulting architecture improves all major inference bottlenecks jointly, including pre-filling, KV-cache storage, and long-context decoding. Experiments across short-context and long-context benchmarks show that CLSA is both accurate and efficient, achieving up to 7.6x decoding speedup and 17.1x overall throughput improvement at 128K context. These results suggest a more complete architectural solution for long-context LLMs that jointly advances model quality and inference efficiency.