PolytechniqueUCSDJun 8, 2026arXiv:2606.08891

PALUTE: Processing-In-Memory Acceleration via Lookup Table for Edge LLM Inference

Runyang Tian, Yanru Chen, Weihong Xu, Tajana Šimunić Rosing

AI Summary

This paper introduces PALUTE, a novel Processing-In-Memory accelerator designed for efficient edge inference of large language models (LLMs) using Lookup Tables (LUTs) to reduce the computational overhead associated with dequantization and nonlinear operations. By leveraging Monolithic 3D DRAM's vertical organization, PALUTE achieves high parallelism and low area overhead, enabling in-DRAM LUT queries that significantly enhance throughput and energy efficiency. The evaluation demonstrates that PALUTE delivers 1,264 transactions per second (TPS) at just 0.16 W, achieving a 12.8× improvement in energy efficiency compared to existing methods like CHIME.

Key Contribution

PALUTE achieves 1,264 TPS at only 0.16 W, revolutionizing edge LLM inference with unprecedented energy efficiency.

Abstract

Large language models are increasingly deployed on edge devices with tight power and area budgets. While mixed-precision GEMM reduces arithmetic complexity, quantized inference is often dominated by dequantization and nonlinear operators. Lookup Table (LUT)-based method mitigates these costs by precomputing outputs and replacing repeated arithmetic with table lookups, but existing designs incur significant capacity and lookup-latency overheads. This paper presents PALUTE, a LUT-based Processing-In-Memory accelerator built on Monolithic 3D DRAM for efficient edge LLM inference. PALUTE enables in-DRAM LUT queries that exploit the vertical organization of M3D DRAM memory array tiles to achieve high parallelism with low area overhead. A near-memory LUT generator supports low-latency LUT generation for both GEMM and element-wise unary nonlinear operators, while a system-level tiering and scheduling strategy minimizes data movement across memory tiers. Evaluation using cycle-accurate simulation and RTL synthesis shows that PALUTE achieves 1,264 TPS end-to-end throughput at 0.16 W, improving energy efficiency by 12.8$\times$ over CHIME and 1.6$\times$ over FIGLUT, improving area efficiency by 2.0$\times$ over PIMPAL under W4A4 across Qwen3-4B models.

Architecture Design (Transformers, SSMs, MoE)Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

PALUTE: Processing-In-Memory Acceleration via Lookup Table for Edge LLM Inference

Related Papers