Apr 28, 2026arXiv:2604.25080

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration

Sean Nian, Jiahao Fang, Qilong Feng, Zhiyu Wu, Fan Lai

AI Summary

CacheFlow tackles the KV cache restoration bottleneck in long-context LLM serving by introducing a 3D parallelism abstraction across tokens, layers, and GPUs. It employs a batch-aware two-pointer scheduler to optimize compute and I/O allocation, prioritizing operations that minimize recomputation. Experiments demonstrate a 10%-62% reduction in Time-To-First-Token (TTFT) compared to existing methods across various models, workloads, and hardware setups.

Key Contribution

CacheFlow slashes LLM serving latency by up to 62% by rethinking KV cache restoration as a 3D-parallel scheduling problem, not just a recompute vs. I/O tradeoff.

Abstract

KV cache restoration has emerged as a dominant bottleneck in serving long-context LLM workloads, including multi-turn conversations, retrieval-augmented generation, and agentic pipelines. Existing approaches treat restoration as a per-request tradeoff between recomputation and I/O transfer, recomputing KV states from scratch or offloading them from external storage (e.g., CPU memory or remote machines). However, existing advances fail to exploit parallelism across tokens, layers, and distributed deployments, and critically ignore resource contention under batched serving. We present CacheFlow, a KV cache restoration framework that rethinks cache restoration as a multi-dimensional parallel execution problem. CacheFlow introduces a unified 3D parallelism abstraction across tokens, layers, and GPUs, enabling fine-grained overlap of recomputation and I/O along the structural dependencies of transformer inference. At the core of CacheFlow is a batch-aware two-pointer scheduler that jointly optimizes compute and I/O allocation across requests by prioritizing operations with the highest marginal reduction in recomputation cost. Our evaluations show that CacheFlow reduces Time-To-First-Token (TTFT) by 10%-62% over existing advances across diverse models, workloads, and hardware.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References25

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

CacheFlow: Efficient LLM Serving with 3D-Parallel KV Cache Restoration

Related Papers