Search papers, labs, and topics across Lattice.
2
0
3
0
Forget assuming NaNs and single-bit flips are the main culprits in GPU silent data corruption; this study reveals they're surprisingly rare, demanding a rethink of fault modeling.
Even moderate GPU fault rates can catastrophically derail LLM training, depending on the specific hardware datapath and numerical precision format.