Apr 23, 2026arXiv:2604.21275

Optimizing High-Throughput Distributed Data Pipelines for Reproducible Deep Learning at Scale

Kashish Mittal, Di Yu, Roozbeh Ketabi, A. Arora, Brendon Lapp, Peng Zhang

AI Summary

This paper tackles data loading bottlenecks in distributed deep learning training pipelines using Petastorm and Parquet datasets. They identify network I/O and CPU-bound transformations as key limitations, causing low GPU utilization. By implementing push-down worker-level transformations, local-disk caching with Fanout-Cache, and deterministic queue management, they achieve a 6x speedup and significantly improved GPU utilization, enabling reproducible large-scale training.

Key Contribution

Data loading bottlenecks can strangle your GPU utilization down to 10%, but a few smart optimizations can unlock a 6x speedup.

Abstract

Training massive-scale deep learning models on datasets spanning tens of terabytes presents critical challenges in hardware utilization and training reproducibility. In this paper, we identify and resolve profound data-loading bottlenecks within distributed GPU training pipelines using the Petastorm data loader and Apache Parquet datasets. Through systematic profiling, we demonstrate that network I/O and CPU-bound data transformations (e.g., PyArrow to NumPy) constrain GPU utilization to as low as 10-15%. To address this, we propose an optimized architecture that features push-down worker-level transformations coupled with local-disk caching via Fanout-Cache, minimizing redundant I/O and CPU overhead across training epochs. Furthermore, we eliminate race conditions in multi-worker shared queues by implementing dedicated round-robin ventilator and result queues, alongside modernized RNG handling, achieving strict deterministic data loading. Our optimizations yield a 6x speedup, reducing end-to-end training time from 22 hours to 3 hours, increasing GPU utilization to over 60%, and drastically reducing run-to-run variance, enabling robust, high-throughput, and reproducible large-scale model training.

Data Curation & Synthetic Data Distributed Systems & Hardware Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References8

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Optimizing High-Throughput Distributed Data Pipelines for Reproducible Deep Learning at Scale

Related Papers