ToulouseToulouse INPApr 29, 2026arXiv:2604.26576

MPI Malleability Validation under Replayed Real-World HPC Conditions

S. Iserte, M. Madon, G. Da, J. Pierson, A. J. Peña

AI Summary

This paper introduces a methodology for validating Dynamic Resource Management (DRM) techniques like MPI malleability by replaying real-world HPC workload logs on a target cluster. The methodology adapts the workload to the target cluster, enabling realistic scenario testing. Experiments on a 125-node partition of Marenostrum 5 showed that parallel efficiency-aware malleability reduced a malleable workload's runtime by 27% without delaying the baseline workload, despite introducing queuing delays for individual jobs.

Key Contribution

MPI malleability can cut HPC workload times by over 25% in real-world conditions, but only if you account for parallel efficiency.

Abstract

Dynamic Resource Management (DRM) techniques can be leveraged to maximize throughput and resource utilization in computational clusters. Although DRM has been extensively studied through analytical workloads and simulations, skepticism persists among end administrators and users regarding their feasibility under real-world conditions. To address this problem, we propose a novel methodology for validating DRM techniques, such as malleability, in realistic scenarios that reproduce actual cluster conditions of jobs and users by replaying workload logs on a High-performance Computing (HPC) infrastructure. Our methodology is capable of adapting the workload to the target cluster. We evaluate our methodology in a malleability-enabled 125-node partition of the Marenostrum 5 supercomputer. Our results validate the proposed method and assess the benefits of MPI malleability on a novel use case of a pioneer user of malleability (our "PhD Student"): parallel efficiency-aware malleability reduced a malleable workload time by 27% without delaying the baseline workload, although introducing queueing delays for individual jobs, but maintaining the resource utilization rate.

Distributed Systems & Hardware

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MPI Malleability Validation under Replayed Real-World HPC Conditions

Related Papers