CASGeorgia TechPolyUMar 3, 2026arXiv:2603.02737

Ouroboros: Wafer-Scale SRAM CIM with Token-Grained Pipelining for Large Language Model Inference

Yiqi Liu, Yudong Pan, Mengdi Wang, Shixin Zhao, Haonan Zhu, Yinhe Han, Yi Han, Lei Zhang, Lei Zhang

AI Summary

Ouroboros is a wafer-scale SRAM CIM architecture designed for efficient LLM inference by performing all operations in-situ to avoid off-chip data movement. The architecture incorporates token-grained pipelining, distributed dynamic KV cache management, and communication-aware mapping to maximize utilization of limited on-chip memory and optimize core allocation. Experimental results demonstrate that Ouroboros achieves significant improvements in throughput (4.1x average, up to 9.1x) and energy efficiency (4.2x average, up to 17x) compared to conventional architectures, particularly for the 13B model.

Key Contribution

Wafer-scale SRAM CIM can deliver up to 17x better energy efficiency for LLM inference by eliminating off-chip data movement and using token-grained pipelining.

Abstract

Conventional LLM inference architectures suffer from high energy and latency due to frequent data movement across memory hierarchies. We propose Ouroboros, a wafer-scale SRAM-based Computing-in-Memory (CIM) architecture that executes all operations in situ, eliminating off-chip migration. To maximize its limited first-level capacity, we introduce three innovations: Token-Grained Pipelining: Replaces sequence-level pipelining to mitigate length variations, boosting utilization and reducing activation storage. Distributed Dynamic KV Cache Management: Decouples memory from compute to leverage fragmented SRAM for efficient KV storage. Communication-Aware Mapping: Optimizes core allocation for locality and fault tolerance across the wafer. Experimental results show Ouroboros achieves average gains of $4.1\times$ in throughput and $4.2\times$ in energy efficiency, peaking at $9.1\times$ and $17\times$ for the 13B model. (*Due to the notification of arXiv"The Abstract field cannot be longer than 1,920 characters", the appeared Abstract is shortened. For the full Abstract, please download the Article.)

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References72

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Ouroboros: Wafer-Scale SRAM CIM with Token-Grained Pipelining for Large Language Model Inference

Related Papers