Search papers, labs, and topics across Lattice.
The paper introduces Flying Serving, a vLLM-based system that dynamically switches between data parallelism (DP) and tensor parallelism (TP) during LLM serving to optimize for throughput, latency, and context capacity under varying workloads. It achieves this by virtualizing model weights and KV cache state, enabling zero-copy TP shard views and preserving KV state across parallelism layouts. Experiments across various LLMs demonstrate performance improvements of up to 4.79x under high load and 3.47x under low load, while also supporting latency- and memory-driven requests.
Achieve up to 4.79x higher throughput in LLM serving by dynamically switching between data and tensor parallelism on the fly, without restarting workers.
Production LLM serving must simultaneously deliver high throughput, low latency, and sufficient context capacity under non-stationary traffic and mixed request requirements. Data parallelism (DP) maximizes throughput by running independent replicas, while tensor parallelism (TP) reduces per-request latency and pools memory for long-context inference. However, existing serving stacks typically commit to a static parallelism configuration at deployment; adapting to bursts, priorities, or long-context requests is often disruptive and slow. We present Flying Serving, a vLLM-based system that enables online DP-TP switching without restarting engine workers. Flying Serving makes reconfiguration practical by virtualizing the state that would otherwise force data movement: (i) a zero-copy Model Weights Manager that exposes TP shard views on demand, (ii) a KV Cache Adaptor that preserves request KV state across DP/TP layouts, (iii) an eagerly initialized Communicator Pool to amortize collective setup, and (iv) a deadlock-free scheduler that coordinates safe transitions under execution skew. Across three popular LLMs and realistic serving scenarios, Flying Serving improves performance by up to $4.79\times$ under high load and $3.47\times$ under low load while supporting latency- and memory-driven requests.