Mar 30, 2026arXiv:2603.29002

Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

Zifan He, Rui Ma, Yizhou Sun, Jason Cong

AI Summary

This paper analyzes the memory processing pipeline in long-context LLM inference, unifying optimizations like sparse attention and RAG into a four-step process. Profiling reveals a significant (22-97%) overhead in memory processing and computational heterogeneity. The authors then demonstrate that a GPU-FPGA heterogeneous system, offloading sparse and memory-bound operations to the FPGA, achieves 1.04-2.2x speedup and 1.11-4.7x energy reduction compared to a GPU-only baseline.

Key Contribution

LLM inference spends up to 97% of its time just *preparing* memory, but offloading that work to an FPGA can more than double inference speed.

Abstract

Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, we identify a 22%-97% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that \textbf{heterogeneous systems} are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs. Evaluated on an AMD MI210 GPU and an Alveo U55C FPGA, our system is $1.04\sim2.2\times$ faster and requires $1.11\sim4.7\times$ less energy across multiple LLM inference optimizations than the GPU baseline (similar results hold on NVIDIA A100). These results establish heterogeneous systems as a practical direction for efficient LLM memory processing and inform future heterogeneous hardware design.

Distributed Systems & Hardware Inference & Quantization Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References57

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference

Related Papers