PKUSJTUFeb 16, 2026arXiv:2602.14516

Efficient Multi-round LLM Inference over Disaggregated Serving

Wenhao He, Youhe Jiang, Penghao Zhao, Quanqing Xu, Eiko Yoneki, Bin Cui, Fangcheng Fu

AI Summary

This paper addresses the challenges of serving multi-round LLM inference workflows under prefill-decode disaggregation, where interleaved prefill and decode phases lead to suboptimal resource utilization. They introduce AMPD, a disaggregated serving framework that adaptively coordinates prefill workloads based on real-time workload demands, optimizing where and how these workloads are scheduled. The framework incorporates a planning algorithm to deduce optimal resource allocation and parallel strategies for prefill and decode phases, leading to improved SLO attainment.

Key Contribution

Multi-round LLM inference gets a major speed boost with AMPD, a new disaggregated serving framework that intelligently manages interleaved prefill-decode workloads.

Abstract

With the rapid evolution of Large Language Models (LLMs), multi-round workflows, such as autonomous agents and iterative retrieval, have become increasingly prevalent. However, this raises hurdles for serving LLMs under prefill-decode (PD) disaggregation, a widely adopted paradigm that separates the compute-bound prefill phase and memory-bound decode phase onto individual resources. Specifically, existing systems overlook the interleaved prefill-decode workload pattern in multi-round inference, leading to sub-optimal handling of the incremental prefill workloads and model deployment for the two phases. In this work, we present AMPD, a brand new disaggregated serving framework for multi-round LLM inference. The core of AMPD is to coordinate the prefill workloads based on real-time workloads by adaptively determining where to carry out these workloads and how they are scheduled, in order to maximize service level objective (SLO) attainment. In addition, we tailor a planning algorithm for our scenario, facilitating the deduction of optimal resource allocation and parallel strategies for the two phases. Empirical results demonstrate that AMPD substantially improves SLO attainment compared to state-of-the-art baselines.

Distributed Systems & Hardware Inference & Quantization Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Efficient Multi-round LLM Inference over Disaggregated Serving

Related Papers