Search papers, labs, and topics across Lattice.
This paper investigates the impact of malleable job scheduling on HPC cluster performance using real-world workload traces from Cori, Eagle, and Theta supercomputers. The authors simulate varying proportions of malleable jobs and evaluate five scheduling strategies, including a novel strategy that prioritizes maintaining malleable jobs at their preferred resource allocation. Results demonstrate that malleable jobs significantly improve job turnaround times, makespan, wait times, and node utilization compared to rigid workloads, even with only 20% of jobs being malleable.
HPC clusters can see up to 99% reduction in job wait times by allowing even a small fraction of jobs to be malleable and dynamically adjust their resource allocation.
Optimizing resource utilization in high-performance computing (HPC) clusters is essential for maximizing both system efficiency and user satisfaction. However, traditional rigid job scheduling often results in underutilized resources and increased job waiting times. This work evaluates the benefits of resource elasticity, where the job scheduler dynamically adjusts the resource allocation of malleable jobs at runtime. Using real workload traces from the Cori, Eagle, and Theta supercomputers, we simulate varying proportions (0-100%) of malleable jobs with the ElastiSim software. We evaluate five job scheduling strategies, including a novel one that maintains malleable jobs at their preferred resource allocation when possible. Results show that, compared to fully rigid workloads, malleable jobs yield significant improvements across all key metrics. Considering the best-performing scheduling strategy for each supercomputer, job turnaround times decrease by 37-67%, job makespan by 16-65%, job wait times by 73-99%, and node utilization improves by 5-52%. Although improvements vary, gains remain substantial even at 20% malleable jobs. This work highlights important correlations between workload characteristics (e.g., job runtimes and node requirements), malleability proportions, and scheduling strategies. These findings confirm the potential of malleability to address inefficiencies in current HPC practices and demonstrate that even limited adoption can provide substantial advantages, encouraging its integration into HPC resource management.