Tsinghua AIUSTBFeb 18, 2026arXiv:2602.16603

FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

Chia-chi Hsieh, Chia-chi Hsieh, Zan Zong, Zan Zong, Xinyang Chen, Xinyang Chen, Xinyang Chen, Jianjiang Li, Jianjiang Li, Jidong Zhai, Jidong Zhai, Lijie Wen, Lijie Wen

AI Summary

The paper addresses head-of-line (HoL) blocking during the prefill phase of LLM serving, which leads to time-to-first-token (TTFT) SLO violations. They introduce FlowPrefill, a system that decouples preemption granularity from scheduling frequency using operator-level preemption and event-driven scheduling. Experiments on production traces demonstrate that FlowPrefill improves maximum goodput by up to 5.6x compared to existing systems while meeting heterogeneous SLOs.

Key Contribution

LLM serving can achieve 5.6x higher throughput without sacrificing latency by decoupling preemption granularity from scheduling frequency.

Abstract

The growing demand for large language models (LLMs) requires serving systems to handle many concurrent requests with diverse service level objectives (SLOs). This exacerbates head-of-line (HoL) blocking during the compute-intensive prefill phase, where long-running requests monopolize resources and delay higher-priority ones, leading to widespread time-to-first-token (TTFT) SLO violations. While chunked prefill enables interruptibility, it introduces an inherent trade-off between responsiveness and throughput: reducing chunk size improves response latency but degrades computational efficiency, whereas increasing chunk size maximizes throughput but exacerbates blocking. This necessitates an adaptive preemption mechanism. However, dynamically balancing execution granularity against scheduling overheads remains a key challenge. In this paper, we propose FlowPrefill, a TTFT-goodput-optimized serving system that resolves this conflict by decoupling preemption granularity from scheduling frequency. To achieve adaptive prefill scheduling, FlowPrefill introduces two key innovations: 1) Operator-Level Preemption, which leverages operator boundaries to enable fine-grained execution interruption without the efficiency loss associated with fixed small chunking; and 2) Event-Driven Scheduling, which triggers scheduling decisions only upon request arrival or completion events, thereby supporting efficient preemption responsiveness while minimizing control-plane overhead. Evaluation on real-world production traces shows that FlowPrefill improves maximum goodput by up to 5.6$\times$ compared to state-of-the-art systems while satisfying heterogeneous SLOs.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References49

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FlowPrefill: Decoupling Preemption from Prefill Scheduling Granularity to Mitigate Head-of-Line Blocking in LLM Serving

Related Papers