Search papers, labs, and topics across Lattice.
This paper performs a large-scale gate-level fault injection study on a production GPU using over 3 million simulator hours to characterize silent data corruption (SDC) patterns. The study reveals that NaNs and infinities are rare SDC outcomes (1.01%), single-bit flips are less frequent than expected (<40%), and corruption addresses exhibit periodicity. These findings challenge common assumptions about SDC and motivate more realistic, distribution-aware fault modeling techniques.
Forget assuming NaNs and single-bit flips are the main culprits in GPU silent data corruption; this study reveals they're surprisingly rare, demanding a rethink of fault modeling.
Silent data corruption (SDC) threatens the reliability of large-scale GPU clusters used for training large language models, yet its rarity and lack of explicit error signals make accurate high-level modeling challenging. To address this gap, we conducted a large-scale gate-level stuck-at fault injection on a production-class data-center GPU, consuming over three million simulator hours across 63 CUDA micro-benchmarks. We extracted GPU SDC characteristics in terms of corruption types, bit-flip behavior, and warp-aligned spatial correlation. Our results show that NaN/+INF/-INF account for only 1.01% of SDC outcomes, that single-bit flips constitute less than 40% of bit-flip events, and that corruption addresses exhibit periodicity. These statistics motivate distribution-aware high-level fault modeling and realistic software-based fault injection for resilience evaluation of production-class GPU architectures.