Search papers, labs, and topics across Lattice.
This paper introduces a methodology for validating Dynamic Resource Management (DRM) techniques like MPI malleability by replaying real-world HPC workload logs on a target cluster. The methodology adapts the workload to the target cluster, enabling realistic scenario testing. Experiments on a 125-node partition of Marenostrum 5 showed that parallel efficiency-aware malleability reduced a malleable workload's runtime by 27% without delaying the baseline workload, despite introducing queuing delays for individual jobs.
MPI malleability can cut HPC workload times by over 25% in real-world conditions, but only if you account for parallel efficiency.
Dynamic Resource Management (DRM) techniques can be leveraged to maximize throughput and resource utilization in computational clusters. Although DRM has been extensively studied through analytical workloads and simulations, skepticism persists among end administrators and users regarding their feasibility under real-world conditions. To address this problem, we propose a novel methodology for validating DRM techniques, such as malleability, in realistic scenarios that reproduce actual cluster conditions of jobs and users by replaying workload logs on a High-performance Computing (HPC) infrastructure. Our methodology is capable of adapting the workload to the target cluster. We evaluate our methodology in a malleability-enabled 125-node partition of the Marenostrum 5 supercomputer. Our results validate the proposed method and assess the benefits of MPI malleability on a novel use case of a pioneer user of malleability (our "PhD Student"): parallel efficiency-aware malleability reduced a malleable workload time by 27% without delaying the baseline workload, although introducing queueing delays for individual jobs, but maintaining the resource utilization rate.