May 4, 2026arXiv:2605.02329

Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference

AI Summary

The paper introduces Kairos, a scheduling system designed to improve SLO attainment and throughput in disaggregated LLM inference by addressing request imbalance caused by long-tail distributions. Kairos uses urgency-based priority scheduling on the prefill side to maximize time-to-first-token (TTFT) SLO attainment and slack-guided adaptive batching on the decode side to maximize throughput while adhering to time-per-output-token (TPOT) SLOs. Experiments show Kairos improves TTFT SLO attainment by up to 23.9%, TPOT SLO attainment by up to 27.1%, end-to-end SLO attainment by up to 33.8%, and decode throughput by up to 19.3% compared to state-of-the-art baselines.

Key Contribution

LLM serving can get a 34% boost in end-to-end SLO attainment by intelligently scheduling prefill and decode requests based on urgency and slack.

Abstract

In production environments, large language model (LLM) serving is required to meet stringent service-level objectives (SLOs) amid highly variable request patterns. In practice, request lengths follow a long-tail distribution, which gives rise to head-of-line blocking on the prefill side and underutilization caused by stragglers on the decode side in disaggregated serving architectures. Current systems, which adopt first-come-first-served (FCFS) scheduling for prefill and continuous batching for decode, lack the ability to adapt to this imbalance, resulting in compromised SLO attainment and reduced throughput. To address these challenges, we propose Kairos, an SLO-aware scheduling system equipped with two complementary mechanisms. On the prefill side, Kairos employs urgency-based priority scheduling: it predicts prefill completion times and dynamically selects requests to maximize the attainment of time-to-first-token (TTFT) SLOs. On the decode side, Kairos introduces slack-guided adaptive batching, which leverages the gap between per-step decode time and the time-per-output-token (TPOT) SLO to greedily pack short requests. This approach maximizes throughput while strictly adhering to SLO requirements. We implement Kairos and conduct evaluations using an online serving dataset and a state-of-the-art LLM. Experimental results demonstrate that, compared with state-of-the-art baselines, Kairos improves TTFT SLO attainment by up to 23.9\%, TPOT SLO attainment by up to 27.1\%, end-to-end SLO attainment by up to 33.8\%, and decode throughput by up to 19.3\%.

Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Taming Request Imbalance: SLO-Aware Scheduling for Disaggregated LLM Inference

Related Papers