Search papers, labs, and topics across Lattice.
NVLLM is a novel hardware architecture designed for efficient on-device LLM inference by tightly integrating 3D NAND flash memory with compute pipelines to offload feed-forward network computations. This design enables direct page-level access to FFN weights from the flash memory, bypassing the DRAM bottleneck and improving memory access efficiency. Evaluated on OPT and LLaMA models, NVLLM achieves significant speedups compared to GPU-based and SSD-like inference systems, demonstrating its potential for resource-constrained edge devices.
Forget GPUs – NVLLM's 3D NAND-centric design slashes LLM inference latency by up to 37.9x on edge devices, making on-device LLMs a real possibility.
The rapid growth of LLMs demands high-throughput, memory-capacity-intensive inference on resource-constrained edge devices, where single-batch decoding remains fundamentally memory-bound. Existing out-of-core GPU-based and SSD-like accelerators are limited by DRAM-bound weight movement and inefficient storage access granularity. We present NVLLM, a 3D NAND-centric inference architecture that offloads feed-forward network (FFN) computation into the Flash while executing attention on lightweight CMOS logic with external DRAM. Through wafer-to-wafer stacking, NVLLM tightly integrates multi-plane 3D NAND with compute pipelines, error correction code (ECC) units, and buffers, enabling page-level FFN weight access without DRAM traversal. All GEMM/GEMV operations are decomposed into dot-product primitives executed by out-of-order PE lanes, operating directly on raw NAND reads with integrated ECC. Attention weights remain in DRAM, and a KV-cache-aware scheduler sustains throughput as the context length grows. Evaluated on OPT and LLaMA models with up to 30B parameters, NVLLM achieves a 16.7$\times$--37.9$\times$ speedup over A800-based out-of-core inference and up to 4.7$\times$ speedup over SSD-like designs, with only 2.7\% CMOS area overhead.