Search papers, labs, and topics across Lattice.
This paper explores the performance of sparse matrix multiplication kernels (SpMM and SDDMM) on the Cerebras CS-3 wafer-scale AI accelerator. They designed and optimized low-level kernels for these operations, focusing on I/O, memory footprint, and scalability. Results show the CS-3 can outperform CPU by up to 100x for SpMM with 90% sparsity, but performance degrades beyond 99% sparsity.
Cerebras CS-3 crushes CPUs on sparse matrix multiplication, but only up to a point: performance tanks at extreme sparsity levels.
In recent years, novel AI accelerators have emerged as promising alternatives to GPU for AI model training and inference tasks. One such accelerator, the Cerebras CS-3, achieves strong performance on large model training as well as scientific applications like molecular dynamics simulations. While dense compute workloads have been thoroughly explored for the CS-3, its potential for sparse workloads has not been fully examined. Applications requiring sparse linear algebra kernels, such as GNNs, linear solvers, and recommendation systems, could achieve good performance on a dataflow accelerator like the CS-3. In this work, we explore two key sparse linear algebra kernels, sparse-dense matrix multiplication (SpMM) and sampled dense-dense matrix multiplication (SDDMM), on the Cerebras CS-3. We propose low-level CS-3 kernel designs for these operations and optimize our designs to improve I/O performance, memory footprint, and scalability to large matrices. Our evaluation examines memory footprint and SpMM/SDDMM speedup relative to CPU. The evaluation suggests that the CS-3 can outperform CPU by 100$\times$ for SpMM with 90\% sparse matrices with performance improving as sparse matrix dimensionality increases. SDDMM on CS-3 can outperform CPU 20$\times$ for 90\% sparse matrices. We additionally find that as sparsity increases to beyond 99\%, the CS-3 suffers from performance degradation that makes it slower than CPU for SpMM.