BolognaSamsung ElectronicsApr 30, 2026arXiv:2604.27808

AME-PIM: Can Memory be Your Next Tensor Accelerator?

Emanuele Venieri, Simone Manoni, Alberto Florian, Jaehyun Park, Kyomin Sohn, Andrea Bartolini

AI Summary

This paper explores the feasibility of using High Bandwidth Memory with Processing-in-Memory (HBM-PIM) as a backend for ISA-level matrix acceleration, specifically using the RISC-V Attached Matrix Extension (AME). They propose a PEP-based execution model that maps AME instructions to HBM-PIM micro-kernels, introducing a reduction-free outer-product dataflow to enable accumulation within memory. Experimental results on Samsung Aquabolt-XL demonstrate that AME matrix tile multiplication achieves up to 14.9 GFLOP/s on a single HBM pseudo-channel.

Key Contribution

HBM-PIM can achieve impressive matrix multiplication throughput (14.9 GFLOP/s) using a novel reduction-free outer-product dataflow, even without native reduction support.

Abstract

High Bandwidth Memory with Processing-in-Memory (HBM-PIM) offers an opportunity to reduce data movement by executing computation directly inside memory, but current commercial platforms expose limited instruction sets and require specialized software stacks. In this work, we investigate whether HBM-PIM can serve as a backend for ISA-level matrix acceleration, using the RISC-V Attached Matrix Extension (AME) as a semantic reference. We propose a PEP-based execution model that maps AME element-wise and matrix instructions to HBM-PIM micro-kernels and data instructions in memory operations. Differently from SoA HBM-PIM, we introduce a reduction-free outer-product dataflow that enables accumulation entirely within memory despite the lack of native reduction support. Our approach supports end-to-end execution of element-wise operations, GEMV, and GEMM in PIM mode, minimizing host involvement and off-chip transfers. An experimental evaluation on Samsung Aquabolt-XL shows that AME matrix tile multiplication achieves up to 14.9 GFLOP/s (59.4 FLOP/cycle) on a single HBM pseudo-channel.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

AME-PIM: Can Memory be Your Next Tensor Accelerator?

Related Papers