CMU MLUMichUni- versity of CaliforniaMay 3, 2026arXiv:2605.01910

Stochastic Sparse Attention for Memory-Bound Inference

Kyle Lee, Corentin Delacour, Kevin Callahan-Coray, Kyle Jiang, Can Yaras, Samet Oymak, Tathagata Srimani, Kerem Y. Camsari

AI Summary

The paper introduces Stochastic Additive No-mulT Attention (SANTA), a method to sparsify value-cache access in autoregressive decoding by sampling a subset of value vectors based on the post-softmax distribution. This approach replaces multiply-accumulates with gather-and-add operations, leading to faster decoding. Experiments show a 1.5x speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada at 32k context length while maintaining baseline accuracy, and the paper also proposes Bernoulli sampling for the score stage to further reduce key-feature access.

Key Contribution

Attention bottlenecks in long-context decoding? SANTA slashes memory bandwidth demands by stochastically sampling value vectors, achieving 1.5x speedups without sacrificing accuracy.

Abstract

Autoregressive decoding becomes bandwidth-limited at long contexts, as generating each token requires reading all $n_k$ key and value vectors from KV cache. We present Stochastic Additive No-mulT Attention (SANTA), a method that sparsifies value-cache access by sampling $S \ll n_k$ indices from the post-softmax distribution and aggregates only those value rows. This yields an unbiased estimator of the post-softmax value aggregation while replacing value-stage multiply-accumulates with gather-and-add. We introduce stratified sampling to design variance-reduced, GPU-friendly variants, demonstrating $1.5\times$ decode-step attention kernel speedup over FlashInfer and FlashDecoding on an NVIDIA RTX 6000 Ada while matching baseline accuracy at 32k-token contexts. Finally, we propose Bernoulli $qK^\mathsf{T}$ sampling as a complementary technique to sparsify the score stage, reducing key-feature access through stochastic ternary queries. Both methods are orthogonal to upstream techniques such as ternary quantization, low-rank projections, and KV-cache compression. Together, they point toward sparse, multiplier-free, and energy-efficient inference. We open-source our kernels at: https://github.com/OPUSLab/SANTA.git

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Stochastic Sparse Attention for Memory-Bound Inference

Related Papers