Search papers, labs, and topics across Lattice.
This paper analyzes GPU sharing via Multi-Instance GPU (MIG) across diverse HPC, AI, and data analytics workloads, finding that while MIG improves resource utilization, performance interference persists due to shared resources like power throttling. To address the mismatch between fixed MIG slices and application needs, the authors propose a memory-offloading scheme using the Nvlink-C2C interconnect. Experiments with NekRS, LAMMPS, Llama3, and Qiskit demonstrate the effectiveness of the proposed offloading scheme in reducing resource underutilization.
Static GPU partitioning alone can't solve underutilization, but fine-grained CPU offloading over Nvlink-C2C can bridge the gap.
Advances in GPU compute throughput and memory capacity brings significant opportunities to a wide range of workloads. However, efficiently utilizing these resources remains challenging, particularly because diverse application characteristics may result in imbalanced utilization. Multi-Instance GPU (MIG) is a promising approach to improve utilization by partitioning GPU compute and memory resources into fixed-size slices with isolation. Yet, its effectiveness and limitations in supporting HPC workloads remain an open question. We present a comprehensive system-level characterization of different GPU sharing options using real-world scientific, AI, and data analytics applications, including NekRS, LAMMPS, Llama3, and Qiskit. Our analysis reveals that while GPU sharing via MIG can significantly reduce resource underutilization, and enable system-level improvements in throughput and energy, interference still occurs through shared resources, such as power throttling. Our performance-resource scaling results indicate that coarse-grained provisioning for tightly coupled compute and memory resources often mismatches application needs. To address this mismatch, we propose a memory-offloading scheme that leverages the cache-coherent Nvlink-C2C interconnect to bridge the gap between coarse-grained resource slices and reduce resource underutilization.