Search papers, labs, and topics across Lattice.
The paper introduces FedQueue, a federated learning protocol designed to mitigate the impact of stochastic queue delays in cross-facility HPC training environments. FedQueue incorporates online queue delay prediction, cutoff-based admission control, and staleness-aware aggregation to improve convergence and robustness. Empirical results on real-world and simulated HPC environments demonstrate that FedQueue achieves up to 34% reduction in time to target accuracy compared to baseline FL algorithms, especially under high queue variance and non-IID data.
FedQueue tackles the Achilles' heel of federated learning on HPC clusters - unpredictable queue delays - by explicitly modeling and mitigating their impact, leading to significant speedups.
Federated learning (FL) across multiple HPC facilities faces stochastic admission delays from batch schedulers that dominate wall-clock time. Synchronous FL suffers from severe stragglers, while asynchronous FL accumulates stale updates when queues spike. We propose FedQueue, a queue-aware FL protocol that incorporates scheduler delays directly into training and aggregation, which (i) predicts per-facility queue delays online to budget local work, (ii) applies cutoff-based admission that buffers late arrivals to bound staleness, and (iii) performs staleness-aware aggregation to stabilize heterogeneous local workloads. We prove the convergence for non-convex objectives at rate $\mathcal{O}(1/\sqrt{R})$ under bounded staleness, and show that the admission controls yield bounded staleness with high probability under queue-prediction error. Real-world cross-facility deployment of FedQueue shows 20.5% improvement over baseline algorithms. Controlled queue simulations demonstrate robust improvement over the baselines; in particular, about 34% reduction in time to reach a target accuracy level under high queue variance and non-IID partitions.