Search papers, labs, and topics across Lattice.
RangeGuard is introduced as a metadata-centric error-correcting framework for DNNs that addresses the increasing vulnerability of LLMs to memory errors due to attention and normalization layers. It protects DNNs by encoding compact Range Identifiers (RIDs) that capture the numerical range of each value, enabling efficient error detection and correction based on semantic deviations. The method tolerates 64+ flipped bits using only 16 bits of parity by focusing protection on range changes, ensuring bounded error magnitudes and reliable DNN execution under frequent memory errors.
RangeGuard lets you tolerate 64+ flipped bits in DNN memory using just 16 bits of parity, without sacrificing accuracy.
As DRAM scales in density and adopts 3D integration, raw fault rates increase and multi-bit errors are no longer rare. Such errors can severely impact Deep Neural Networks (DNNs): although DNNs tolerate small numerical perturbations, random bit flips can create extreme outliers that propagate and sharply degrade accuracy. Large Language Models (LLMs) are particularly vulnerable because attention, residual, and normalization layers can amplify and preserve a single corrupted activation across many layers, destabilizing inference. This paper introduces RangeGuard, a metadata-centric error-correcting framework that provides strong reliability and high efficiency based on bounded approximate correction. Instead of protecting raw bits, RangeGuard encodes compact Range Identifiers (RIDs) that capture the numerical range of each value. These compact metadata enable efficient use of limited redundancy and concentrate protection on range changes, which indicate harmful semantic deviations, while ignoring benign intra-range variations. Upon detecting a range change, RangeGuard restores the correct range and substitutes a representative value, ensuring that error magnitudes are bounded within the range. Based on RIDs, RangeGuard can tolerate 64+ flipped bits using only 16 bits of parity available in GPU memories without a noticeable accuracy loss. By introducing semantic range protection, RangeGuard enables reliable DNN execution even under frequent memory errors and tight redundancy budgets.