Oregon StateORNLFeb 26, 2026arXiv:2602.22593

FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving

Shouwei Gao, Junqi Yin, Feiyi Wang, Wenqian Dong

AI Summary

The paper introduces Flying Serving, a vLLM-based system that dynamically switches between data parallelism (DP) and tensor parallelism (TP) during LLM serving to optimize for throughput, latency, and context capacity under varying workloads. It achieves this by virtualizing model weights and KV cache state, enabling zero-copy TP shard views and preserving KV state across parallelism layouts. Experiments across various LLMs demonstrate performance improvements of up to 4.79x under high load and 3.47x under low load, while also supporting latency- and memory-driven requests.

Key Contribution

Achieve up to 4.79x higher throughput in LLM serving by dynamically switching between data and tensor parallelism on the fly, without restarting workers.

Abstract

Production LLM serving must simultaneously deliver high throughput, low latency, and sufficient context capacity under non-stationary traffic and mixed request requirements. Data parallelism (DP) maximizes throughput by running independent replicas, while tensor parallelism (TP) reduces per-request latency and pools memory for long-context inference. However, existing serving stacks typically commit to a static parallelism configuration at deployment; adapting to bursts, priorities, or long-context requests is often disruptive and slow. We present Flying Serving, a vLLM-based system that enables online DP-TP switching without restarting engine workers. Flying Serving makes reconfiguration practical by virtualizing the state that would otherwise force data movement: (i) a zero-copy Model Weights Manager that exposes TP shard views on demand, (ii) a KV Cache Adaptor that preserves request KV state across DP/TP layouts, (iii) an eagerly initialized Communicator Pool to amortize collective setup, and (iv) a deadlock-free scheduler that coordinates safe transitions under execution skew. Across three popular LLMs and realistic serving scenarios, Flying Serving improves performance by up to $4.79\times$ under high load and $3.47\times$ under low load while supporting latency- and memory-driven requests.

Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References46

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FLYING SERVING: On-the-Fly Parallelism Switching for Large Language Model Serving

Related Papers