DUTLancaster UniversityNJITSEUApr 12, 2026arXiv:2604.10484

Strix: Re-thinking NPU Reliability from a System Perspective

Jiapeng Guan, Jie Zhang, Hao Zhou, Ran Wei, Dean You, Yingquan Wang, Tinglue Wang, Xudong Zhao, Zhe Jiang

AI Summary

Strix is presented as a full-stack NPU reliability framework that addresses the increasing frequency of hardware faults in DNN/LLM accelerators. It achieves this by re-partitioning the NPU along the system inference pipeline, identifying dominant failure modes, and applying targeted safeguards. The framework demonstrates sub-microsecond fault localization, error detection, and correction with a 1.04x slowdown and minimal hardware overhead on an open-source SoC.

Key Contribution

Fine-grained partitioning and targeted safeguards can provide robust NPU reliability with minimal performance overhead, challenging the assumption that redundancy is the only path forward.

Abstract

DNNs and LLMs increasingly rely on hardware accelerators, including in safety-critical domains, while technology scaling and growing model complexity make hardware faults more frequent. Existing system-level mechanisms typically treat the NPU as a monolithic unit, using coarse-grained replication that incurs prohibitive performance and hardware overheads, leaving a gap between reliability requirements and deployable solutions. To bridge this gap, we present Strix, a full-stack NPU reliability framework on an open-source SoC, spanning micro-architecture, ISA, and programming methods. Strix re-partitions the NPU along the system inference pipeline, identifies dominant failure modes, and attaches targeted safeguards, achieving sub-micro-second fault localisation, error detection, and correction with only 1.04$\times$ slowdown and minimal hardware overhead.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Strix: Re-thinking NPU Reliability from a System Perspective

Related Papers