Search papers, labs, and topics across Lattice.
This paper introduces PALUTE, a novel Processing-In-Memory accelerator designed for efficient edge inference of large language models (LLMs) using Lookup Tables (LUTs) to reduce the computational overhead associated with dequantization and nonlinear operations. By leveraging Monolithic 3D DRAM's vertical organization, PALUTE achieves high parallelism and low area overhead, enabling in-DRAM LUT queries that significantly enhance throughput and energy efficiency. The evaluation demonstrates that PALUTE delivers 1,264 transactions per second (TPS) at just 0.16 W, achieving a 12.8脳 improvement in energy efficiency compared to existing methods like CHIME.
PALUTE achieves 1,264 TPS at only 0.16 W, revolutionizing edge LLM inference with unprecedented energy efficiency.
Large language models are increasingly deployed on edge devices with tight power and area budgets. While mixed-precision GEMM reduces arithmetic complexity, quantized inference is often dominated by dequantization and nonlinear operators. Lookup Table (LUT)-based method mitigates these costs by precomputing outputs and replacing repeated arithmetic with table lookups, but existing designs incur significant capacity and lookup-latency overheads. This paper presents PALUTE, a LUT-based Processing-In-Memory accelerator built on Monolithic 3D DRAM for efficient edge LLM inference. PALUTE enables in-DRAM LUT queries that exploit the vertical organization of M3D DRAM memory array tiles to achieve high parallelism with low area overhead. A near-memory LUT generator supports low-latency LUT generation for both GEMM and element-wise unary nonlinear operators, while a system-level tiering and scheduling strategy minimizes data movement across memory tiers. Evaluation using cycle-accurate simulation and RTL synthesis shows that PALUTE achieves 1,264 TPS end-to-end throughput at 0.16 W, improving energy efficiency by 12.8$\times$ over CHIME and 1.6$\times$ over FIGLUT, improving area efficiency by 2.0$\times$ over PIMPAL under W4A4 across Qwen3-4B models.