Search papers, labs, and topics across Lattice.
This paper introduces PULSE, an innovative automatic pipeline-parallel training strategy designed to optimize the training of large diffusion models by addressing the communication bottlenecks caused by skip connections in UNet-style architectures. By collocating skip-connected layers on the same device and caching activations locally, PULSE significantly reduces inter-device communication, which is a major efficiency hurdle in conventional pipeline parallelism. Experimental results demonstrate that PULSE can cut communication volume by 89% and boost training throughput by up to 2.3x on communication-bound hardware, marking a substantial advancement in scaling diffusion model training.
PULSE slashes communication overhead by 89% while boosting training throughput by up to 2.3x, revolutionizing how we scale diffusion models across GPU clusters.
Diffusion models are now a dominant approach for high-fidelity image and video generation, yet scaling their training across GPU clusters remains challenging. Unlike transformer-only architectures, diffusion backbones commonly adopt UNet-style encoder-decoder structures with heterogeneous layers and long-range skip connections. Under conventional pipeline parallelism, these non-local dependencies force large skip activations and their gradients to traverse multiple pipeline boundaries, making peer-to-peer (P2P) communication a dominant bottleneck and substantially reducing pipeline efficiency. In this paper, we present PULSE, an automatic pipeline-parallel training strategy that makes skip locality a first-class optimization objective. PULSE eliminates skip-induced communication by collocating skip-connected encoder-decoder layers on the same device and caching skip activations locally for later use in backpropagation. To realize this placement while maintaining high pipeline utilization, PULSE co-designs: (1) a skip-aware dynamic-programming partitioner that balances heterogeneous stage workloads under symmetric collocation constraints, (2) an ILP-based schedule synthesizer that generates bubble-efficient wave schedules for the resulting stage-to-device mapping, and (3) a hybrid parallelism tuner that selects pipeline/data-parallel degrees and microbatch sizes under memory and network constraints. Our extensive experiments show that the volume of communication can be reduced by 89 percent, and the training throughput can be increased by up to 2.3x on communication-bound hardware, compared with state-of-the-art parallelism strategies.