Search papers, labs, and topics across Lattice.
This paper introduces O-POPE, a novel outer-product engine designed to optimize the performance of general matrix multiply (GEMM) operations in machine learning workloads by minimizing buffering overhead. By repurposing floating-point unit (FPU) pipeline registers as buffers, O-POPE achieves remarkable FPU utilization of 99.97% and operates at 1 GHz in advanced 12 nm FINFET technology. The results demonstrate a significant performance improvement of 1.33x, alongside enhancements in performance density and energy efficiency compared to existing floating-point GEMM accelerators.
Achieving 99.97% FPU utilization with O-POPE redefines efficiency in high-frequency GEMM operations, pushing the boundaries of performance and energy consumption in ML hardware.
General matrix multiply (GEMM) dominates both execution time and energy consumption of modern machine learning (ML) workloads, placing increasing pressure on hardware efficiency. While quantization mitigates computational and data movement costs, accuracy-sensitive tasks such as training still require higher-precision floating-point formats. Existing floating-point GEMM accelerators face trade-offs between operating frequency, arithmetic utilization, and buffering overhead. This work presents O-POPE, a scalable outer-product engine that achieves concurrently high utilization, low overhead, and a fast operating frequency by repurposing floating-point unit (FPU) pipeline registers as buffers. This solution leverages the data-reuse advantages of output-stationary outer-product execution and enables 1 GHz (0.72 V) operation in 12 nm FINFET technology with less than 2% buffer area for a 2048-MACs configuration. Our evaluation shows that O-POPE achieves up to 99.97% FPU utilization and improves performance (1.33x), performance density by 9%, and energy efficiency by 8%, compared to state-of-the-art floating-point GEMM accelerators.