Search papers, labs, and topics across Lattice.
This paper introduces a unified API and enhanced DMR framework for managing dynamic resources in HPC, enabling seamless integration of diverse reconfiguration methods. The framework builds upon DMRlib and incorporates the Proteo reconfiguration engine, allowing for more flexible malleability strategies without full process respawning. Evaluation with the MPDATA solver demonstrates improved performance and coding productivity.
Unlock HPC application malleability without the headache of process respawning thanks to this unified dynamic resource management API.
This paper presents an efficient tool for managing dynamic resources in production high-performance computing (HPC) settings, focusing on flexibility, adaptability, and user-friendliness. We introduce a unified dynamic resource management application programming interface (API) that supports a wide range of HPC applications, allowing seamless integration without direct interaction with Dynamic Management of Resources (DMR). The DMR framework, evolved from the DMRlib structure, now supports various dynamic resource managers and includes the Proteo reconfiguration engine to enhance malleability strategies. This integration addresses previous limitations by allowing diverse reconfiguration methods without respawning all processes or lacking RMS support. The paper also showcases the solution's performance and coding productivity with the MPDATA (Multidimensional Positive Definite Advection Transport Algorithm) application. Key contributions include an enhanced modular DMR framework supporting different reconfiguration managers, upgraded DMRlib with the Proteo reconfiguration engine, offering extensive reconfiguration strategies, and a malleable version of the MPDATA solver.