BeihangHKUSTIndependent ResearcherPKURUCMay 25, 2026arXiv:2605.25550

DisagFusion: Asynchronous Pipeline Parallelism and Elastic Scheduling for Disaggregated Diffusion Serving

Hantian Zha, Teng Ma, Haiwen Fu, Ruiyang Ma, Wei Gao, Ruihao Gong, Xianglong Liu, Yunpeng Chai

AI Summary

DisagFusion addresses the challenge of serving large diffusion models by disaggregating the encoder, diffusion transformer (DiT), and decoder stages across heterogeneous GPUs. It introduces asynchronous pipeline parallelism to overlap computation and communication, and a hybrid instance scheduling strategy that combines performance prediction with runtime feedback to dynamically rebalance resources across stages. Experiments with modern diffusion models demonstrate a 3.4x-20.5x throughput improvement and 18.5x latency reduction compared to monolithic deployment.

Key Contribution

DisagFusion unlocks up to 20x higher throughput for diffusion model serving by intelligently splitting the workload across heterogeneous GPUs and dynamically adapting to workload shifts.

Abstract

Diffusion-based generation is increasingly powering production content pipelines; however, deploying these models at scale remains a significant challenge. Model weights frequently exceed the memory capacity of commodity GPUs, while the encoder, diffusion transformer (DiT), and decoder stages exhibit highly imbalanced computational and memory footprints. A natural remedy is disaggregated serving-running stages as separate services on heterogeneous GPUs-yet this introduces new bottlenecks, including stage handoff overheads and fast-changing workloads that make cross-stage provisioning and scheduling brittle. This paper presents DisagFusion, enabling asynchronous pipeline parallelism and elastic scheduling for disaggregated diffusion serving. First, DisagFusion introduces asynchronous pipeline parallelism that overlaps computation and stage-to-stage communication to reduce pipeline bubbles and mitigate network jitter. Second, DisagFusion employs a hybrid instance scheduling strategy that combines lightweight performance prediction with runtime feedback to continuously rebalance instance ratio across stages under workload shifts. We implement DisagFusion and evaluate it with modern diffusion models. Compared to a monolithic baseline, DisagFusion improves throughput by 3.4x-20.5x and reduces end-to-end latency by 18.5x, while enabling flexible, cost-efficient deployment across heterogeneous GPUs.

Computer Vision Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DisagFusion: Asynchronous Pipeline Parallelism and Elastic Scheduling for Disaggregated Diffusion Serving

Related Papers