Search papers, labs, and topics across Lattice.
Ouroboros is a wafer-scale SRAM CIM architecture designed for efficient LLM inference by performing all operations in-situ to avoid off-chip data movement. The architecture incorporates token-grained pipelining, distributed dynamic KV cache management, and communication-aware mapping to maximize utilization of limited on-chip memory and optimize core allocation. Experimental results demonstrate that Ouroboros achieves significant improvements in throughput (4.1x average, up to 9.1x) and energy efficiency (4.2x average, up to 17x) compared to conventional architectures, particularly for the 13B model.
Wafer-scale SRAM CIM can deliver up to 17x better energy efficiency for LLM inference by eliminating off-chip data movement and using token-grained pipelining.
Conventional LLM inference architectures suffer from high energy and latency due to frequent data movement across memory hierarchies. We propose Ouroboros, a wafer-scale SRAM-based Computing-in-Memory (CIM) architecture that executes all operations in situ, eliminating off-chip migration. To maximize its limited first-level capacity, we introduce three innovations: Token-Grained Pipelining: Replaces sequence-level pipelining to mitigate length variations, boosting utilization and reducing activation storage. Distributed Dynamic KV Cache Management: Decouples memory from compute to leverage fragmented SRAM for efficient KV storage. Communication-Aware Mapping: Optimizes core allocation for locality and fault tolerance across the wafer. Experimental results show Ouroboros achieves average gains of $4.1\times$ in throughput and $4.2\times$ in energy efficiency, peaking at $9.1\times$ and $17\times$ for the 13B model. (*Due to the notification of arXiv"The Abstract field cannot be longer than 1,920 characters", the appeared Abstract is shortened. For the full Abstract, please download the Article.)