Search papers, labs, and topics across Lattice.
The paper introduces VMXDOTP, a RISC-V Vector ISA extension designed to accelerate microscaling (MX) data formats in vector processing elements. VMXDOTP supports MXFP8 and MXFP4 inputs with FP32 and BF16 accumulation, addressing the challenges of irregular vector pipelines caused by MX semantics. Implemented in 12nm FinFET, the VMXDOTP-enhanced VPE cluster achieves high utilization (up to 97% on MX-MatMul) and significant speedup (up to 7.0x) and energy efficiency (up to 4.9x) compared to software emulation.
Unlock up to 7x speedup and 4.9x energy efficiency in MX-MatMul with VMXDOTP, a RISC-V Vector ISA extension that finally makes microscaling data formats practical for vector processors.
Compared to the first generation of deep neural networks, dominated by regular, compute-intensive kernels such as matrix multiplications (MatMuls) and convolutions, modern decoder-based transformers interleave attention, normalization, and data-dependent control flow. This demands flexible accelerators, a requirement met by scalable, highly energy-efficient shared-L1-memory vector processing element (VPE) clusters. Meanwhile, the ever-growing size and bandwidth needs of state-of-the-art models make reduced-precision formats increasingly attractive. Microscaling (MX) data formats, based on block floating-point (BFP) representations, have emerged as a promising solution to reduce data volumes while preserving accuracy. However, MX semantics are poorly aligned with vector execution: block scaling and multi-step mixed-precision operations break the regularity of vector pipelines, leading to underutilized compute resources and performance degradation. To address these challenges, we propose VMXDOTP, a RISC-V Vector (RVV) 1.0 instruction set architecture (ISA) extension for efficient MX dot product execution, supporting MXFP8 and MXFP4 inputs, FP32 and BF16 accumulation, and software-defined block sizes. A VMXDOTP-enhanced VPE cluster achieves up to 97 % utilization on MX-MatMul. Implemented in 12 nm FinFET, it achieves up to 125 MXFP8-GFLOPS and 250 MXFP4-GFLOPS, with 843/1632 MXFP8/MXFP4-GFLOPS/W at 1 GHz, 0.8 V, and only 7.2 % area overhead. Our design yields up to 7.0x speedup and 4.9x energy efficiency with respect to software-emulated MXFP8-MatMul. Compared with prior MX engines, VMXDOTP supports variable block sizes, is up to 1.4x more area-efficient, and delivers up to 2.1x higher energy efficiency.