AirbnbGuizhou UniversityInner Mongolia UniversitySJTUJun 11, 2026arXiv:2606.13501

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

Xinwei Qiang, Yifan Hu, Shixuan Sun, Jing Yang, Han Zhao, Chen Chen, Yu Feng, Jingwen Leng, Minyi Guo

AI Summary

This paper introduces GF-DiT, a policy-programmable runtime designed for elastic serving of Diffusion Transformers (DiTs) that dynamically adjusts GPU parallelism based on workload demands and service objectives. By treating GPU parallelism as a first-class resource, GF-DiT enhances efficiency in handling diverse request types and execution stages, addressing the limitations of static parallelism. The implementation of GF-DiT in vLLM-Omni demonstrates significant performance improvements, achieving up to 6.01× throughput gains and a 95% reduction in mean latency compared to traditional fixed-pipeline execution.

Key Contribution

GF-DiT achieves up to 6.01× throughput improvement and 95% latency reduction by dynamically adapting GPU parallelism in response to workload demands.

Abstract

Diffusion Transformers (DiTs) have become the dominant architecture for image and video generation, creating growing demand for efficient DiT serving. Existing systems assign each request a fixed parallel configuration throughout its lifetime. However, DiT workloads exhibit substantial heterogeneity across requests, execution stages, and system conditions, making static parallelism inefficient and often leading to poor GPU utilization and degraded service quality. This paper argues that DiT serving should treat GPU parallelism as a first-class schedulable resource. We present GF-DiT, a policy-programmable runtime for elastic DiT serving that dynamically adapts the parallelism of running requests according to workload demands and service objectives. GF-DiT introduces an asynchronous execution abstraction that decomposes requests into independently schedulable trajectory tasks and enables online GPU reallocation. To make elastic parallelism practical, GF-DiT further proposes group-free collectives, a lightweight communication abstraction that supports low-overhead online formation and reconfiguration of arbitrary execution groups. We implement GF-DiT in vLLM-Omni and evaluate it on representative image and video diffusion workloads. Compared with fixed-pipeline execution with static parallelism, GF-DiT improves throughput by up to 6.01$\times$, reduces mean latency by up to 95%, lowers SLO violation rates by up to 90%, and reduces communication-group setup overhead from 778 ms to approximately 60 $\mu$s.

Computer Vision Distributed Systems & Hardware

Citation Metrics

Citations0

Influential citations0

References39

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GF-DiT: Scheduling Parallelism for Diffusion Transformer Serving

Related Papers