Apr 30, 2026arXiv:2604.27384

RCW-CIM: A Digital CIM-based LLM Accelerator with Read-Compute/Write

Yan-Cheng Guo, Tian-Sheuan Chang, Jian-Wei Su

AI Summary

This paper introduces a read-compute/write (RCW) architecture for digital computing-in-memory (DCIM) to accelerate large language models (LLMs) by minimizing weight update latency. The proposed RCW architecture incorporates a nonlinear operator fusion mechanism and a weight-stationary and output column stationary (WS-OCS) dataflow to reduce both external DRAM access and internal CIM weight updates. Experimental results on the Llama2-7B model demonstrate a 21.59% reduction in decoding computing latency, a 69.17% latency reduction through nonlinear operator fusion, and a 49.76% overall latency reduction during the prefill phase.

Key Contribution

Forget waiting – this new CIM architecture slashes LLM weight update latency by up to 87%, unlocking faster prefill and decoding.

Abstract

Digital computing-in-memory (DCIM) has emerged as a promising solution for large language model (LLM) acceleration by minimizing data transfers between external DRAM and on-chip accelerators while maintaining high precision for superior accuracy. However, existing CIM architectures often overlook weight update latency, which becomes critical as LLM weights are far larger than a single CIM macro capacity. To address this issue, this paper proposes a read-compute/write (RCW) architecture that effectively minimizes weight update latency, along with a nonlinear operator fusion that further mitigates dependencyinduced latency. The proposed RCW reduces decoding computing latency by 21.59% on the Llama2-7B model. In addition, the nonlinear operator fusion mechanism achieves a 69.17% latency reduction through efficient partial accumulation and group-based approximation. Furthermore, a weight-stationary and output column stationary (WS-OCS) dataflow is introduced to reduce both external DRAM access and internal CIM weight updates by 51.6% and 87.6% respectively during the prefill phase of 1024 tokens, leading to an overall 49.76% latency reduction. Fabricated using TSMC 22 nm CMOS technology and operating at 100 MHz, the proposed RCW-CIM achieves 3.28 TOPS and 42.3 TOPS/W, enabling 4.2 ms prefill latency and 26.87 decoded tokens per second for the INT4-weight Llama2 model with dual DDR5-6400 memory.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References8

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

RCW-CIM: A Digital CIM-based LLM Accelerator with Read-Compute/Write

Related Papers