Search papers, labs, and topics across Lattice.
This paper introduces SISA, a systolic array architecture that partitions a traditional square array into independently scheduled horizontal rectangular slabs to address the underutilization of processing elements (PEs) caused by input-dependent and skewed matrices in LLMs. SISA enables efficient execution of small or skewed matrix shapes by exposing parallelism through these independently scheduled slabs, while still supporting full-array operation for large GEMMs. Experimental results demonstrate that SISA achieves up to 8.52x speedup and 93% energy-delay-product (EDP) reduction compared to a monolithic SA with the same number of PEs when running representative LLMs.
LLMs' skewed matrix shapes need not hamstring systolic array performance: SISA's partitioned architecture achieves up to 8.52x speedup and 93% EDP reduction compared to monolithic arrays.
The currently dominant AI/ML workloads, such as Large Language Models (LLMs), rely on the efficient execution of General Matrix-Matrix Multiplication (GEMM) operations. Thus, most systems are equipped with dedicated matrix hardware accelerators based on square Systolic Arrays (SAs) of Processing Elements (PEs). While this organization was effective for traditional Deep Neural Networks (DNNs), LLMs introduce input-dependent and highly skewed matrices, leading to underutilized SA resources. To address this challenge, we propose SISA (Scale-In Systolic Array), a novel SA architecture that partitions the traditional square array into horizontal rectangular slabs. With minimal overhead, SISA exposes parallelism through independently scheduled slabs for efficient execution of small or skewed matrix shapes, while retaining full-array operation for large GEMMs. SISA achieves up to 8.52x speedup and 93% energy-delay-product (EDP) reduction for representative LLMs compared to a state-of-the-art monolithic SA with the same number of PEs.