Mar 18, 2026arXiv:2603.17803

Swarm: Co-Activation Aware KVCache Offloading Across Multiple SSDs

Tuowei Wang, Liyun Chu, Ruwen Fan, Ju Ren

AI Summary

This paper introduces Swarm, a system for offloading LLM KV-caches to multiple SSDs to alleviate memory pressure and reduce costs associated with DRAM. Swarm exploits the observed "KVCache Co-Activation" phenomenon, where accessing one KV entry is highly correlated with accessing a recurring set of other KV entries. By clustering co-activated KV entries and distributing them across multiple SSDs using a graph-based placement strategy, Swarm achieves a 2.41x reduction in I/O time and a 2.72x improvement in effective bandwidth utilization.

Key Contribution

Forget slow, single-SSD paging: Swarm unlocks 2.7x higher bandwidth for LLM KV-cache offloading by exploiting stable co-activation patterns to parallelize I/O across multiple SSDs.

Abstract

The key-value (KV) cache has become the dominant contributor to memory consumption in large language model (LLM) inference. Although offloading KVCache from GPU high-bandwidth memory (HBM) to CPU DRAM alleviates device memory pressure, DRAM remains capacity-limited and costly for large, persistent workloads. Solid-state drives (SSDs) provide a cost-effective alternative, but naive SSD-based paging is fundamentally bandwidth-bound due to limited PCIe throughput and per-device bandwidth constraints. In this paper, we observe that KVCache activations in real-world workloads exhibit strong and stable correlations. We term this phenomenon KVCache Co-Activation, where accessing a KV entry is often accompanied by a stable and recurring set of other KV entries. Leveraging this property, we present Swarm, an SSD-based KVCache offloading system that converts bandwidth-bound single-device access into parallel I/O across multiple SSDs. Specifically, Swarm clusters co-activated KV entries offline and distributes the resulting clusters across SSDs using graph-based placement with selective replication to maximize parallel I/O bandwidth. At runtime, Swarm performs load-balanced cluster retrieval and dynamically adapts clustering and caching decisions to sustain high bandwidth utilization under evolving access patterns. Evaluations show that Swarm reduces I/O time by 2.41x and improves effective bandwidth utilization by 2.72x.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References50

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Swarm: Co-Activation Aware KVCache Offloading Across Multiple SSDs

Related Papers