Search papers, labs, and topics across Lattice.
The paper introduces SHIELD, a segmented eDRAM architecture tailored for energy-efficient LLM inference on edge NPUs. SHIELD exploits the varying temporal residency and bit-level sensitivity of BF16 activations by segmenting the eDRAM and applying differentiated refresh rates based on activation lifecycle and data significance. Results show a 35% reduction in eDRAM refresh energy without compromising accuracy across several LLMs and benchmarks.
You can slash LLM inference energy by 35% on edge devices just by intelligently managing eDRAM refresh rates based on activation data type and lifespan.
Large Language Model (LLM) inference on edge Neural Processing Units (NPUs) is fundamentally constrained by limited on-chip memory capacity. Although high-density embedded DRAM (eDRAM) is attractive for storing activation workspaces, its periodic refresh consumes substantial energy. Prior work has primarily focused on reducing off-chip traffic or optimizing refresh for persistent Key-Value (KV) caches, while transient and error-resilient Query and Attention Output (QO) activations are largely overlooked. We propose SHIELD, a lifecycle-aware segmented eDRAM architecture that jointly exploits temporal residency and bit-level sensitivity in bfloat16 (BF16) activations. SHIELD isolates the sign and exponent fields from the mantissa, disables refresh for transient QO mantissas, and applies relaxed refresh to persistent KV mantissas. Across multiple LLMs and inference scenarios, SHIELD reduces eDRAM refresh energy by 35% relative to a standard-refresh baseline while preserving accuracy on WikiText-2, PIQA, and ARC-Easy.