CharlieMcGillUChicagoMar 17, 2026arXiv:2603.16514

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

Xunzhuo Liu, Yuhan Liu, Junchen Jiang, Bowei He, Xue Liu

AI Summary

FleetOpt is introduced, an analytical framework for optimizing LLM GPU fleet provisioning that minimizes cost while meeting latency SLOs by dynamically adjusting context length boundaries. It models the fleet as M/G/c queues and determines that a two-pool architecture (short and long context) is optimal, with a boundary B* derived from an equal marginal GPU cost condition. The framework overcomes the "cost cliff" challenge via Compress-and-Route (C&R), which uses gateway-layer extractive compression to trim borderline requests, achieving 6-82% GPU cost reduction compared to homogeneous fleets across production traces.

Key Contribution

LLM GPU fleets can be analytically optimized into a two-pool architecture with gateway-layer compression, slashing costs by up to 82% without sacrificing latency.

Abstract

Modern LLM GPU fleets are provisioned for worst-case context lengths that the vast majority of requests never approach, wasting GPU capacity on idle KV-cache slots. We present FleetOpt, a framework that starts from first principles: given a workload's prompt-length CDF and a P99 TTFT target, derive the minimum-cost fleet analytically, then deploy it in practice. The analytical core models each pool as an M/G/c queue and derives that the minimum-cost fleet is a two-pool architecture -- a short-context pool and a long-context pool -- with an optimal boundary B* satisfying an equal marginal GPU cost condition across both pools. The fundamental barrier to achieving B* is the cost cliff: a hard routing step where requests just above B* consume 8x--42x more GPU capacity than requests just below it (depending on the context window ratio), creating a structural disincentive to lower the boundary. Compress-and-Route (C&R) is the implementation mechanism that resolves this barrier. Gateway-layer extractive compression trims borderline requests below B* before the engine ever sees them, converting the hard hardware boundary into a software parameter read from the workload CDF. The two components are unified in the FleetOpt offline planner: given a CDF and SLO, it returns the optimal (n_s*, n_l*, B*, gamma*) in under 1 ms. On three production traces, the combined framework reduces total GPU cost by 6--82% versus a homogeneous fleet, with C&R contributing 1--44 percentage points beyond plain pool routing depending on workload archetype. The analytical model is validated against a discrete-event simulator (inference-fleet-sim) with<= 3% error on predicted GPU utilization across all pools and workloads.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations1

Influential citations0

References22

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FleetOpt: Analytical Fleet Provisioning for LLM Inference with Compress-and-Route as Implementation Mechanism

Related Papers