Apr 9, 2026arXiv:2604.08451

Taming GPU Underutilization via Static Partitioning and Fine-grained CPU Offloading

Gabin Schieffer, Gabin Schieffer, Ruimin Shi, Ruimin Shi, Jie Ren, Jie Ren, Ivy Peng

AI Summary

This paper analyzes GPU sharing via Multi-Instance GPU (MIG) across diverse HPC, AI, and data analytics workloads, finding that while MIG improves resource utilization, performance interference persists due to shared resources like power throttling. To address the mismatch between fixed MIG slices and application needs, the authors propose a memory-offloading scheme using the Nvlink-C2C interconnect. Experiments with NekRS, LAMMPS, Llama3, and Qiskit demonstrate the effectiveness of the proposed offloading scheme in reducing resource underutilization.

Key Contribution

Static GPU partitioning alone can't solve underutilization, but fine-grained CPU offloading over Nvlink-C2C can bridge the gap.

Abstract

Advances in GPU compute throughput and memory capacity brings significant opportunities to a wide range of workloads. However, efficiently utilizing these resources remains challenging, particularly because diverse application characteristics may result in imbalanced utilization. Multi-Instance GPU (MIG) is a promising approach to improve utilization by partitioning GPU compute and memory resources into fixed-size slices with isolation. Yet, its effectiveness and limitations in supporting HPC workloads remain an open question. We present a comprehensive system-level characterization of different GPU sharing options using real-world scientific, AI, and data analytics applications, including NekRS, LAMMPS, Llama3, and Qiskit. Our analysis reveals that while GPU sharing via MIG can significantly reduce resource underutilization, and enable system-level improvements in throughput and energy, interference still occurs through shared resources, such as power throttling. Our performance-resource scaling results indicate that coarse-grained provisioning for tightly coupled compute and memory resources often mismatches application needs. To address this mismatch, we propose a memory-offloading scheme that leverages the cache-coherent Nvlink-C2C interconnect to bridge the gap between coarse-grained resource slices and reduce resource underutilization.

Distributed Systems & Hardware Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References31

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Taming GPU Underutilization via Static Partitioning and Fine-grained CPU Offloading

Related Papers