Apr 20, 2026arXiv:2604.17834

AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures

AI Summary

This paper introduces AsyncSparse, a novel approach to accelerate Sparse Matrix-Matrix Multiplication (SpMM) on modern GPUs by leveraging asynchronous features like NVIDIA's Tensor Memory Accelerator (TMA) and warp specialization. They co-design two kernels: one for structured sparsity using a warp-specialized producer-consumer pipeline with Block Compressed Sparse Row (BCSR), and another for irregular sparsity using a Window Compressed Sparse Row (WCSR) format. Experiments demonstrate significant performance improvements, including a 1.47x speedup over AccSpMM and a 2.66x end-to-end speedup on Qwen2.5-7B prefill.

Key Contribution

Asynchronous GPU features like NVIDIA's TMA can unlock up to 6x speedups in sparse matrix multiplication, but only with careful kernel co-design.

Abstract

Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel across scientific computing and machine learning. While prior work accelerates SpMM using Tensor Cores, no existing sparse kernel exploits the asynchronous features of modern GPU architectures, such as NVIDIA's Tensor Memory Accelerator (TMA) and warp specialization. This work systematically studies how these features impact SpMM performance and introduces two co-designed kernels. For structured sparsity, we optimize a warp-specialized producer-consumer pipeline overlapping TMA data transfer with WGMMA computation using Block Compressed Sparse Row (BCSR) format. For irregular sparsity, we design a Window Compressed Sparse Row (WCSR) kernel that loads the sparse operand via TMA and splits large row-windows across thread blocks for load balancing. Our WCSR kernel outperforms all prior SpMM kernels on SuiteSparse matrices (1.47x over AccSpMM, 6.24x over cuSPARSE). Our BCSR kernel achieves a combined 2.66x end-to-end speedup on Qwen2.5-7B prefill at 90% block sparsity with 64K tokens over cuDNN/cuBLAS.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References48

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AsyncSparse: Accelerating Sparse Matrix-Matrix Multiplication on Asynchronous GPU Architectures

Related Papers