Jun 17, 2026arXiv:2606.19163

Pulse: Training Acceleration for Large Diffusion Models with Automatic Pipeline Parallelism

Boran Sun, Guoyong Jiang, Lin Zhang, Chen Chen, Yuechen Tao, Zhishu Che, Jieling Yu, Shan Chang, Huaxi Gu, Fangming Liu, Bo Li

AI Summary

This paper introduces PULSE, an innovative automatic pipeline-parallel training strategy designed to optimize the training of large diffusion models by addressing the communication bottlenecks caused by skip connections in UNet-style architectures. By collocating skip-connected layers on the same device and caching activations locally, PULSE significantly reduces inter-device communication, which is a major efficiency hurdle in conventional pipeline parallelism. Experimental results demonstrate that PULSE can cut communication volume by 89% and boost training throughput by up to 2.3x on communication-bound hardware, marking a substantial advancement in scaling diffusion model training.

Key Contribution

PULSE slashes communication overhead by 89% while boosting training throughput by up to 2.3x, revolutionizing how we scale diffusion models across GPU clusters.

Abstract

Diffusion models are now a dominant approach for high-fidelity image and video generation, yet scaling their training across GPU clusters remains challenging. Unlike transformer-only architectures, diffusion backbones commonly adopt UNet-style encoder-decoder structures with heterogeneous layers and long-range skip connections. Under conventional pipeline parallelism, these non-local dependencies force large skip activations and their gradients to traverse multiple pipeline boundaries, making peer-to-peer (P2P) communication a dominant bottleneck and substantially reducing pipeline efficiency. In this paper, we present PULSE, an automatic pipeline-parallel training strategy that makes skip locality a first-class optimization objective. PULSE eliminates skip-induced communication by collocating skip-connected encoder-decoder layers on the same device and caching skip activations locally for later use in backpropagation. To realize this placement while maintaining high pipeline utilization, PULSE co-designs: (1) a skip-aware dynamic-programming partitioner that balances heterogeneous stage workloads under symmetric collocation constraints, (2) an ILP-based schedule synthesizer that generates bubble-efficient wave schedules for the resulting stage-to-device mapping, and (3) a hybrid parallelism tuner that selects pipeline/data-parallel degrees and microbatch sizes under memory and network constraints. Our extensive experiments show that the volume of communication can be reduced by 89 percent, and the training throughput can be increased by up to 2.3x on communication-bound hardware, compared with state-of-the-art parallelism strategies.

Computer Vision Distributed Systems & Hardware Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Pulse: Training Acceleration for Large Diffusion Models with Automatic Pipeline Parallelism

Related Papers