Search papers, labs, and topics across Lattice.
This paper introduces MatrixFlow, a systolic-array accelerator, and a co-design methodology using Gem5-AcceSys to optimize transformer inference by explicitly overlapping computation and data transfer. They achieve this through paged streaming dataflows and a page-aligned block matrix multiplication method using small 4KB tiles and a small on-chip buffer. Simulation results using BERT and ViT show up to 22x speedup over CPU and 5-8x over existing accelerators, demonstrating that efficient data streaming is more critical than large SRAMs for transformer inference.
Forget massive SRAMs: this work shows that clever data streaming and compute/transfer overlap can yield 22x speedups for transformer inference, even with standard PCIe interconnects.
Transformers have revolutionized AI in natural language processing and computer vision, but their large computation and memory demands pose major challenges for hardware acceleration. In practice, end-to-end throughput is often limited by paged data movement and interconnect bandwidth rather than raw MAC count. This work proposes a unified system-accelerator co-design approach for transformer inference that jointly optimizes a matrix accelerator and its system integration through paged streaming dataflows and explicit overlap of compute and transfer. On the hardware side, we introduce MatrixFlow, a loosely coupled 16x16 systolic-array accelerator with a page-aligned block matrix multiplication method using 4 KB tiles, a small on-chip buffer of about 20 KB, and a pipelined schedule of DMA, compute, and DMA-out to utilize interconnect bandwidth efficiently. On the system side, we develop Gem5-AcceSys, an extension of the gem5 full-system simulator that explores standard interconnects such as PCIe and configurable memory hierarchies including Direct Memory, Direct Cache, and Device Memory modes with SMMU/TLB effects. We evaluate the co-design using gem5 simulations on representative transformer models including BERT and ViT across multiple data types and system setups. Results show up to 22x end-to-end speedup over a CPU-only baseline and 5x to 8x gains over state-of-the-art loosely and tightly coupled accelerators. We further show that a standard PCIe-based host-memory design can achieve about 80 percent of the performance of on-device HBM. Overall, paged streaming and pipeline overlap, rather than large local SRAMs, are the most effective levers for efficient transformer inference under realistic system constraints.