Search papers, labs, and topics across Lattice.
This paper addresses the lack of fault resilience in NVIDIA's Multi-Process Service (MPS), which limits its use in multi-tenant GPU clusters. Through systematic fault characterization and analysis of the GPU processing pipeline, the authors identify memory-related faults as dominant and amenable to software isolation. They then develop a fault isolation mechanism for these faults, coupled with a fast recovery mechanism using virtual memory-based GPU-resident state sharing for other fault types, achieving effective fault handling with minimal overhead.
A fault in one GPU process no longer needs to crash them all: this paper introduces mechanisms for fault-resilient NVIDIA MPS, enabling more robust multi-tenant GPU clusters.
NVIDIA Multi-Process Service (MPS) enables fine-grained GPU sharing by allowing multiple processes to execute concurrently on the same GPU, making it an important mechanism for improving GPU utilization. However, MPS has weak fault resilience: a fault in one process can terminate all co-running processes, limiting its adoption in resilience-critical settings such as multi-tenant GPU clusters. In this work, we design fault-resilient MPS to solve this problem. Our design is guided by insights from a systematic characterization of GPU faults and a deep analysis of their end-to-end processing pipeline. Based on these insights, we design two complementary mechanisms. A fault isolation mechanism for the dominant memory-related faults that can be fully isolated by software intervention in the open GPU driver kernel module. For other faults whose process is within proprietary software, we design a practical mechanism -- fast recovery using virtual memory based GPU-resident state sharing. Our evaluation on different GPUs and workloads shows that these mechanisms can handle corresponding faults effectively with minimal overhead.