MilaCharlieMcGillSteveUChicagoMar 17, 2026arXiv:2603.16054

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

AI Summary

The paper introduces inference-fleet-sim, a tool that combines M/G/c queueing theory with discrete-event simulation to optimize GPU fleet configuration for LLM inference. It determines the minimum-cost fleet that meets a P99 time-to-first-token (TTFT) SLO, considering token-length distributions, routing policies, and queueing dynamics. The tool incorporates a physics-informed GPU performance model for A10G, A100, and H100 GPUs across various deployment topologies, demonstrating its ability to identify optimal configurations in fleet-planning scenarios where simpler analyses fail.

Key Contribution

Seemingly idle LLM inference fleets can be secretly broken, and this simulator helps you find out why before you buy.

Abstract

Sizing a GPU fleet for LLM inference is harder than it looks. The obvious questions -- how many GPUs, which type, where to split a two-pool fleet -- have no closed-form answers. They depend on the full token-length distribution, the routing policy, and queueing dynamics that turn ugly under heavy-tailed workloads. Existing tools optimize per-engine configuration for a fixed GPU count; none of them address the upstream question of how many GPUs to buy and how to arrange them. inference-fleet-sim fills that gap. It combines analytical M/G/c queueing with discrete-event simulation (DES) to find the minimum-cost fleet configuration that empirically meets a P99 TTFT SLO. It includes a physics-informed GPU performance model covering A10G, A100, and H100 across monolithic, two-pool-routed, and disaggregated topologies, all without requiring access to real hardware. We run the tool on seven fleet-planning scenarios drawn from two public workload traces (LMSYS, Azure) and one synthetic agent-heavy trace. Each one surfaces a result that simple analysis gets wrong -- the right split threshold, the cheapest GPU type, whether an apparently idle fleet is actually broken -- and shows why joint simulation of queueing, routing, and hardware is necessary to find it.

Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations2

Influential citations1

References22

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

inference-fleet-sim: A Queueing-Theory-Grounded Fleet Capacity Planner for LLM Inference

Related Papers