NVIDIAMay 5, 2026arXiv:2605.04213

The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance

Chung-Hsuan Tung, Yanxiang Huang, N. Saxena, Philip P. Shirvani, Saurabh Hukerikar, Twinkle Jain, Abhishek Tyagi, Sanjay Gongalore

AI Summary

This paper performs a large-scale gate-level fault injection study on a production GPU using over 3 million simulator hours to characterize silent data corruption (SDC) patterns. The study reveals that NaNs and infinities are rare SDC outcomes (1.01%), single-bit flips are less frequent than expected (<40%), and corruption addresses exhibit periodicity. These findings challenge common assumptions about SDC and motivate more realistic, distribution-aware fault modeling techniques.

Key Contribution

Forget assuming NaNs and single-bit flips are the main culprits in GPU silent data corruption; this study reveals they're surprisingly rare, demanding a rethink of fault modeling.

Abstract

Silent data corruption (SDC) threatens the reliability of large-scale GPU clusters used for training large language models, yet its rarity and lack of explicit error signals make accurate high-level modeling challenging. To address this gap, we conducted a large-scale gate-level stuck-at fault injection on a production-class data-center GPU, consuming over three million simulator hours across 63 CUDA micro-benchmarks. We extracted GPU SDC characteristics in terms of corruption types, bit-flip behavior, and warp-aligned spatial correlation. Our results show that NaN/+INF/-INF account for only 1.01% of SDC outcomes, that single-bit flips constitute less than 40% of bit-flip events, and that corruption addresses exhibit periodicity. These statistics motivate distribution-aware high-level fault modeling and realistic software-based fault injection for resilience evaluation of production-class GPU architectures.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware

Citation Metrics

Citations0

Influential citations0

References43

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance

Related Papers