Mar 2, 2026arXiv:2603.01915

Fast Entropy Decoding for Sparse MVM on GPUs

Emil Schatzle, Emil Schätzle, Tommaso Pegolotti, Tommaso Pegolotti, M. Puschel, Markus Püschel

AI Summary

The paper introduces dtANS, a novel lossless entropy coding method based on asymmetric numeral systems (ANS), to compress sparse matrices for faster SpMVM on GPUs. By applying dtANS to the CSR format, the method achieves significant matrix size reduction compared to cuSPARSE, especially for large matrices with sufficient density. This compression leads to SpMVM speedups for a majority of large matrices, outperforming cuSPARSE and showing potential to improve upon AI-based approaches like AlphaSparse.

Key Contribution

Achieve up to 3.48x faster sparse matrix-vector multiplication on GPUs by compressing matrices with a new entropy coding technique, dtANS, that's faster to decode in parallel than existing methods.

Abstract

We present a novel, practical approach to speed up sparse matrix-vector multiplication (SpMVM) on GPUs. The novel key idea is to apply lossless entropy coding to further compress the sparse matrix when stored in one of the commonly supported formats. Our method is based on dtANS, our new lossless compression method that improves the entropy coding technique of asymmetric numeral systems (ANS) specifically for fast parallel GPU decoding when used in tandem with SpMVM. We apply dtANS on the widely used CSR format and present extensive benchmarks on the SuiteSparse collection of matrices against the state-of-the-art cuSPARSE library. On matrices with at least 2^(15) entries and at least 10 entries per row on average, our compression reduces the matrix size over the smallest cuSPARSE format (CSR, COO and SELL) in almost all cases and up to 11.77 times. Further, we achieve an SpMVM speedup for the majority of matrices with at least 2^(25) nonzero entries. The best speedup is 3.48x. We also show that we can improve over the AI-based multi-format AlphaSparse in an experiment that is limited due to its extreme computation overhead. We provide our code as an open source C++/CUDA header library, which includes both compression and multiplication kernels.

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References46

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Fast Entropy Decoding for Sparse MVM on GPUs

Related Papers