Search papers, labs, and topics across Lattice.
This paper introduces AsyncSparse, a novel approach to accelerate Sparse Matrix-Matrix Multiplication (SpMM) on modern GPUs by leveraging asynchronous features like NVIDIA's Tensor Memory Accelerator (TMA) and warp specialization. They co-design two kernels: one for structured sparsity using a warp-specialized producer-consumer pipeline with Block Compressed Sparse Row (BCSR), and another for irregular sparsity using a Window Compressed Sparse Row (WCSR) format. Experiments demonstrate significant performance improvements, including a 1.47x speedup over AccSpMM and a 2.66x end-to-end speedup on Qwen2.5-7B prefill.
Asynchronous GPU features like NVIDIA's TMA can unlock up to 6x speedups in sparse matrix multiplication, but only with careful kernel co-design.
Sparse Matrix-Matrix Multiplication (SpMM) is a fundamental kernel across scientific computing and machine learning. While prior work accelerates SpMM using Tensor Cores, no existing sparse kernel exploits the asynchronous features of modern GPU architectures, such as NVIDIA's Tensor Memory Accelerator (TMA) and warp specialization. This work systematically studies how these features impact SpMM performance and introduces two co-designed kernels. For structured sparsity, we optimize a warp-specialized producer-consumer pipeline overlapping TMA data transfer with WGMMA computation using Block Compressed Sparse Row (BCSR) format. For irregular sparsity, we design a Window Compressed Sparse Row (WCSR) kernel that loads the sparse operand via TMA and splits large row-windows across thread blocks for load balancing. Our WCSR kernel outperforms all prior SpMM kernels on SuiteSparse matrices (1.47x over AccSpMM, 6.24x over cuSPARSE). Our BCSR kernel achieves a combined 2.66x end-to-end speedup on Qwen2.5-7B prefill at 90% block sparsity with 64K tokens over cuDNN/cuBLAS.