Search papers, labs, and topics across Lattice.
The paper introduces SCIN, a switch-centric in-network architecture for LLM inference that overcomes limitations of accelerator-centric approaches like NVLS by enabling direct broadcast of reduced data and supporting in-network quantization. SCIN employs an in-switch accelerator (ISA) to initiate memory-semantic operations and integrates a quantization module for 8-bit All-Reduce, reducing bandwidth requirements. FPGA-based experiments demonstrate SCIN accelerates All-Reduce by up to 8.7x for small messages and achieves up to 1.74x faster TTFT and 1.34x faster TPOT on LLaMA-2 models.
Forget GPU-centric All-Reduce: SCIN's switch-based architecture slashes latency by up to 8.7x and boosts LLaMA-2 performance by 34% through in-network quantization.
In-network computing techniques, exemplified by NVLink Sharp (NVLS), offer a promising approach to addressing the communication bottlenecks in LLM inference by offloading collective operations, such as All-Reduce, to switches. However, the accelerator-centric architecture of NVLS suffers from two fundamental limitations: 1) it relies on GPU load instructions to trigger reduction operations, which means that the data reduced in the switch must be additionally transferred back to the initiating GPU rather than being broadcast directly, thereby introducing unnecessary communication overhead; 2) due to its architectural constraints, NVLS cannot offload operators that are not decomposable into memory-semantic instructions, such as the in-network quantization (INQ) proposed in this work. As a result, All-Reduce in NVLS must operate at FP16/BF16 precision, leading to substantial bandwidth waste.To address these limitations, we propose SCIN, the first switch-centric in-network architecture for shared-memory networks of AI accelerators, enabling both low-latency and high-bandwidth All-Reduce. Specifically, we introduce an in-switch accelerator (ISA) capable of initiating memory-semantic operations for in-network processing, together with a co-designed communication fabric that incurs negligible protocol overhead. By eliminating redundant data movement, SCIN delivers lower All-Reduce latency than NVLS. Moreover, by integrating a quantization module into the ISA, SCIN enables INQ for All-Reduce, reducing its precision to 8 bits and nearly doubling bandwidth with negligible accuracy loss. We also present a prototype of SCIN on a multi-FPGA system to demonstrate its feasibility and effectiveness. Experimental results show that our design accelerates All-Reduce by up to 8.7x for small messages and 3.8x for large messages, leading up to 1.74x faster TTFT and 1.34x faster TPOT on LLaMA-2 models.