DAMOBairong Inc.School of Information Science and TechnologyFeb 26, 2026arXiv:2602.22580

FuxiShuffle: An Adaptive and Resilient Shuffle Service for Distributed Data Processing on Alibaba Cloud

Yuhao Lin, Zhipeng Tang, Jia Tong, Junqing Xiao, Binhan Lu, Yuhang Li, Chao Li, Zhi-guo Zhang, Junhua Wang, Hao Luo, James Cheng, Chuang Hu, Xiaodan Yan

AI Summary

The paper introduces FuxiShuffle, a shuffle service designed for Alibaba Cloud's MaxCompute, addressing the limitations of existing systems in adapting to dynamic job characteristics and providing efficient failure resilience. FuxiShuffle achieves adaptability through dynamic shuffle mode selection, progress-aware scheduling, and adaptive backup strategies, while ensuring resilience via multi-replica failover, careful memory management, and incremental recovery. Experimental results demonstrate that FuxiShuffle reduces job completion time and resource consumption compared to baseline systems.

Key Contribution

Alibaba's FuxiShuffle dynamically adapts to workload and resource fluctuations in ultra-large distributed data processing, slashing job completion times and resource consumption where prior systems falter.

Abstract

Shuffle exchanges intermediate results between upstream and downstream operators in distributed data processing and is usually the bottleneck due to factors such as small random I/Os and network contention. Several systems have been designed to improve shuffle efficiency, but from our experiences of running ultra-large clusters at Alibaba Cloud MaxCompute platform, we observe that they can not adapt to highly dynamic job characteristics and cluster resource conditions, and their fault tolerance mechanisms are passive and inefficient when failures are inevitable. To tackle their limitations, we design and implement FuxiShuffle as a general data shuffle service for the ultra-large production environment of MaxCompute, featuring good adaptability and efficient failure resilience. Specifically, to achieve good adaptability, FuxiShuffle dynamically selects the shuffle mode based on runtime information, conducts progress-aware scheduling for the downstream workers, and automatically determines the most suitable backup strategy for each shuffle data chunk. To make failure resilience efficient, FuxiShuffle actively ensures data availability with multi-replica failover, prevents memory overflow with careful memory management, and employs an incremental recovery mechanism that does not lose computation progress. Our experiments show that, compared to baseline systems, FuxiShuffle significantly reduces not only end-to-end job completion time but also aggregate resource consumption. Micro experiments suggest that our designs are effective in improving adaptability and failure resilience.

Distributed Systems & Hardware Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References54

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FuxiShuffle: An Adaptive and Resilient Shuffle Service for Distributed Data Processing on Alibaba Cloud

Related Papers