FudanPKUFeb 25, 2026arXiv:2602.21548

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Yongtong Wu, Shaoyuan Chen, Yinmin Zhong, Rilin Huang, Yixuan Tan, Wentao Zhang, Liyue Zhang, Shangyan Zhou, Yuxuan Liu, Shunfeng Zhou, Mingxing Zhang, Xin Jin, Panpan Huang

AI Summary

The paper identifies KV-Cache storage I/O as the primary bottleneck in multi-turn, agentic LLM inference within disaggregated architectures due to bandwidth saturation of storage NICs on prefill engines. To address this, they introduce DualPath, an inference system that adds a novel storage-to-decode path for KV-Cache loading, followed by RDMA transfer to prefill engines. Results on production agentic workloads demonstrate a throughput improvement of up to 1.87x for offline inference and 1.96x for online serving without SLO violations.

Key Contribution

Double your LLM inference throughput by routing KV-cache through decoding engines to bypass the bandwidth bottleneck on prefill engines.

Abstract

The performance of multi-turn, agentic LLM inference is increasingly dominated by KV-Cache storage I/O rather than computation. In prevalent disaggregated architectures, loading the massive KV-Cache from external storage creates a fundamental imbalance: storage NICs on prefill engines become bandwidth-saturated, while those on decoding engines remain idle. This asymmetry severely constrains overall system throughput. We present DualPath, an inference system that breaks this bottleneck by introducing dual-path KV-Cache loading. Beyond the traditional storage-to-prefill path, DualPath enables a novel storage-to-decode path, in which the KV-Cache is loaded into decoding engines and then efficiently transferred to prefill engines via RDMA over the compute network. DualPath combines this optimized data path -- which inherently avoids network congestion and avoids interference with latency-critical model execution communications -- with a global scheduler that dynamically balances load across prefill and decode engines. Our evaluation on three models with production agentic workloads demonstrates that DualPath improves offline inference throughput by up to 1.87$\times$ on our in-house inference system. It can also improve online serving throughput by an average factor of 1.96$\times$ without violating SLO.

Distributed Systems & Hardware Inference & Quantization Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References50

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DualPath: Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference

Related Papers