PKUApr 1, 2026arXiv:2604.00499

Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions

Haoyu Zheng, Yongqiang Zhang, Fangcheng Fu, Xiaokai Zhou, Hao Luo, Hongchao Zhu, Yuanyuan Zhu, Hao Wang, Xiao Yan, Jiawei Jiang

AI Summary

This paper addresses the problem of scheduling LLM inference requests by modeling the stochastic nature of output length generation. They observe that output length follows a heavy-tailed distribution best fitted by a log-t distribution, and propose a Tail Inflated Expectation (TIE) metric to account for the risk of long outputs. Experiments demonstrate that TIE-based scheduling reduces per-token latency by 2.31x for online inference and improves throughput by 1.42x for offline data generation compared to existing methods.

Key Contribution

Stop guessing how long LLM outputs will be – modeling the *distribution* of possible lengths slashes latency by 2x and boosts throughput by 40%.

Abstract

To schedule LLM inference, the \textit{shortest job first} (SJF) principle is favorable by prioritizing requests with short output lengths to avoid head-of-line (HOL) blocking. Existing methods usually predict a single output length for each request to facilitate scheduling. We argue that such a \textit{point estimate} does not match the \textit{stochastic} decoding process of LLM inference, where output length is \textit{uncertain} by nature and determined by when the end-of-sequence (EOS) token is sampled. Hence, the output length of each request should be fitted with a distribution rather than a single value. With an in-depth analysis of empirical data and the stochastic decoding process, we observe that output length follows a heavy-tailed distribution and can be fitted with the log-t distribution. On this basis, we propose a simple metric called Tail Inflated Expectation (TIE) to replace the output length in SJF scheduling, which adjusts the expectation of a log-t distribution with its tail probabilities to account for the risk that a request generates long outputs. To evaluate our TIE scheduler, we compare it with three strong baselines, and the results show that TIE reduces the per-token latency by $2.31\times$ for online inference and improves throughput by $1.42\times$ for offline data generation.

Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Scheduling LLM Inference with Uncertainty-Aware Output Length Predictions

Related Papers