Apr 8, 2026arXiv:2604.06956

NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining

Zhida Jiang, Zhaolong Xing, Huichao Chai, Tianxing Sun, Qiang Peng, Baopeng Yuan, Jiaxing Wang, Hua Du, Zhixin Wu, Xuemiao Li, Yikui Cao, Yongxiang Feng, Zhen Chen, Ke Zhang

AI Summary

NestPipe, a decentralized embedding training framework, addresses data movement bottlenecks in large-scale recommendation models by exploiting hierarchical sparse parallelism. It uses Dual-Buffer Pipelining (DBP) to mitigate lookup bottlenecks with staleness-free synchronization and Frozen-Window Pipelining (FWP) to overlap All2All communication with dense computation. Experiments on 1,536 workers show NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency while preserving synchronous training semantics.

Key Contribution

Training trillion-parameter recommendation models at scale doesn't have to be bottlenecked by data movement: NestPipe achieves 3x speedup on 1500+ accelerators by overlapping communication and computation.

Abstract

Modern recommendation models have increased to trillions of parameters. As cluster scales expand to O(1k), distributed training bottlenecks shift from computation and memory to data movement, especially lookup and communication latency associated with embeddings. Existing solutions either optimize only one bottleneck or improve throughput by sacrificing training consistency. This paper presents NestPipe, a large-scale decentralized embedding training framework that tackles both bottlenecks while preserving synchronous training semantics. NestPipe exploits two hierarchical sparse parallelism opportunities through nested pipelining. At the inter-batch level, Dual-Buffer Pipelining (DBP) constructs a staleness-free five-stage pipeline through dual-buffer synchronization, mitigating lookup bottlenecks without embedding staleness. At the intra-batch level, we identify the embedding freezing phenomenon, which inspires Frozen-Window Pipelining (FWP) to overlap All2All communication with dense computation via coordinated stream scheduling and key-centric sample clustering. Experiments on production GPU and NPU clusters with 1,536 workers demonstrate that NestPipe achieves up to 3.06x speedup and 94.07% scaling efficiency.

Distributed Systems & Hardware Recommendation & Information Retrieval Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

NestPipe: Large-Scale Recommendation Training on 1,500+ Accelerators via Nested Pipelining

Related Papers