Apr 28, 2026arXiv:2604.25317

FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture

Zihao Xuan, Jia Chen, Yewen Li, Wei Xuan, Hegan Chen, Xiao Huo, Fengbin Tu

AI Summary

This paper introduces FusionCIM, a compute-in-memory (CIM) architecture designed to enhance the efficiency of large language model (LLM) inference through operator fusion. By integrating a hybrid CIM pipeline for matrix multiplication and implementing a QO-stationary dataflow to optimize data reuse, the architecture significantly reduces energy consumption and increases processing speed. Experimental results demonstrate that FusionCIM achieves up to 3.86x energy savings and 1.98x speedup over existing state-of-the-art CIM designs, achieving an impressive energy efficiency of 29.4 TOPS/W.

Key Contribution

FusionCIM slashes LLM inference energy costs by nearly 4x while doubling processing speed, setting a new benchmark for efficiency in AI hardware.

Abstract

In this paper, we propose FusionCIM, an operator-fusion-driven compute-in-memory (CIM) accelerator architecture for efficient and scalable LLM inference, with three key innovations: (1) a hybrid CIM pipeline architecture that maps QKT computation on inner-product-based CIM (IP-CIM) and PV aggregation on outer-product-based CIM (OP-CIM) for efficient matrix multiplications fusion; (2) a QO-stationary dataflow that eliminates repeated KV loading in CIM and K-matrix access in buffer under transpose fusion, significantly improving data reuse on chip; and (3) a pattern-aware online-softmax mechanism that exploits distribution regularities of attention scores to reduce exponential rescaling overhead for non-linear fusion. Experimental results on LLaMA-3 model show that FusionCIM achieves up to 3.86x energy saving, and 1.98x speedup compared with prior SOTA CIM-based designs with 29.4 TOPS/W energy efficiency at the system level.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References22

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FusionCIM: Accelerating LLM Inference with Fusion-Driven Computing-in-Memory Architecture

Related Papers