May 26, 2026arXiv:2605.26461

Characterization-Guided GPU Fault Resilience in NVIDIA MPS

Rixin Liu, Xingqi Cui, Kaijian Wang, Xinheng Ding, Zirui Liu, Yuke Wang, Jiarong Xing

AI Summary

This paper addresses the lack of fault resilience in NVIDIA's Multi-Process Service (MPS), which limits its use in multi-tenant GPU clusters. Through systematic fault characterization and analysis of the GPU processing pipeline, the authors identify memory-related faults as dominant and amenable to software isolation. They then develop a fault isolation mechanism for these faults, coupled with a fast recovery mechanism using virtual memory-based GPU-resident state sharing for other fault types, achieving effective fault handling with minimal overhead.

Key Contribution

A fault in one GPU process no longer needs to crash them all: this paper introduces mechanisms for fault-resilient NVIDIA MPS, enabling more robust multi-tenant GPU clusters.

Abstract

NVIDIA Multi-Process Service (MPS) enables fine-grained GPU sharing by allowing multiple processes to execute concurrently on the same GPU, making it an important mechanism for improving GPU utilization. However, MPS has weak fault resilience: a fault in one process can terminate all co-running processes, limiting its adoption in resilience-critical settings such as multi-tenant GPU clusters. In this work, we design fault-resilient MPS to solve this problem. Our design is guided by insights from a systematic characterization of GPU faults and a deep analysis of their end-to-end processing pipeline. Based on these insights, we design two complementary mechanisms. A fault isolation mechanism for the dominant memory-related faults that can be fully isolated by software intervention in the open GPU driver kernel module. For other faults whose process is within proprietary software, we design a practical mechanism -- fast recovery using virtual memory based GPU-resident state sharing. Our evaluation on different GPUs and workloads shows that these mechanisms can handle corresponding faults effectively with minimal overhead.

Distributed Systems & Hardware Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References39

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Characterization-Guided GPU Fault Resilience in NVIDIA MPS

Related Papers