Search papers, labs, and topics across Lattice.
This paper introduces FusionCIM, a compute-in-memory (CIM) architecture designed to enhance the efficiency of large language model (LLM) inference through operator fusion. By integrating a hybrid CIM pipeline for matrix multiplication and implementing a QO-stationary dataflow to optimize data reuse, the architecture significantly reduces energy consumption and increases processing speed. Experimental results demonstrate that FusionCIM achieves up to 3.86x energy savings and 1.98x speedup over existing state-of-the-art CIM designs, achieving an impressive energy efficiency of 29.4 TOPS/W.
FusionCIM slashes LLM inference energy costs by nearly 4x while doubling processing speed, setting a new benchmark for efficiency in AI hardware.
In this paper, we propose FusionCIM, an operator-fusion-driven compute-in-memory (CIM) accelerator architecture for efficient and scalable LLM inference, with three key innovations: (1) a hybrid CIM pipeline architecture that maps QKT computation on inner-product-based CIM (IP-CIM) and PV aggregation on outer-product-based CIM (OP-CIM) for efficient matrix multiplications fusion; (2) a QO-stationary dataflow that eliminates repeated KV loading in CIM and K-matrix access in buffer under transpose fusion, significantly improving data reuse on chip; and (3) a pattern-aware online-softmax mechanism that exploits distribution regularities of attention scores to reduce exponential rescaling overhead for non-linear fusion. Experimental results on LLaMA-3 model show that FusionCIM achieves up to 3.86x energy saving, and 1.98x speedup compared with prior SOTA CIM-based designs with 29.4 TOPS/W energy efficiency at the system level.