Apr 28, 2026arXiv:2604.26074

DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference

AI Summary

This paper introduces DAK, a direct-access memory offloading framework for LLM inference that bypasses traditional prefetching methods. DAK repurposes the Tensor Memory Accelerator (TMA) to asynchronously fetch offloaded weights and KV caches directly from remote memory into GPU shared memory. By optimizing offloading ratios with a greedy algorithm and implementing active congestion control, DAK achieves near-optimal bandwidth aggregation, resulting in significant performance gains compared to state-of-the-art memory offloading baselines.

Key Contribution

Forget prefetching: DAK unlocks up to 3x faster LLM inference by enabling direct GPU access to remote memory, achieving near-optimal system bandwidth utilization.

Abstract

LLM inference is constrained by GPU memory capacity and bandwidth. Tiered memory architectures mitigate this by allowing the GPU to offload memory to the remote tier. However, existing memory offloading frameworks rely on prefetching data into local GPU HBM. This approach underutilizes system resources by introducing HBM contention, squandering memory capacity, and creating pipeline bubbles. We show that enabling direct GPU access to remote memory significantly outperforms prefetching, achieving optimal aggregate system bandwidth. We propose DAK, an end-to-end direct-access memory offloading framework that repurposes the Tensor Memory Accelerator (TMA) to asynchronously fetch offloaded weights and KV caches directly from remote memory into GPU shared memory (SMEM). To maximize remote access performance, DAK introduces a greedy algorithm to determine optimal per-operation offloading ratios, alongside active congestion control and TMA multicast to eliminate interconnect bottlenecks and read amplification. Evaluations across diverse architectures show that DAK achieves near-optimal bandwidth aggregation, with up to 3$\times$ performance gains on NVLink-C2C and 1.8$\times$ on PCIe systems compared to state-of-the-art memory offloading baselines.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DAK: Direct-Access-Enabled GPU Memory Offloading with Optimal Efficiency for LLM Inference

Related Papers