Georgia TechPolytechniqueUCLAUCSDMar 1, 2026arXiv:2603.01175

HAVEN: High-Bandwidth Flash Augmented Vector Engine for Large-Scale Approximate Nearest-Neighbor Search Acceleration

Po-Kai Hsu, Weihong Xu, Qunyou Liu, Tajana Rosing, Shimeng Yu

AI Summary

The paper introduces HAVEN, a GPU architecture augmented with High-Bandwidth Flash (HBF) to accelerate large-scale approximate nearest neighbor search (ANNS) for Retrieval-Augmented Generation (RAG). HAVEN integrates HBF and a near-storage search unit on-package to eliminate PCIe and DDR bottlenecks during the reranking stage of IVF-PQ, enabling the full-precision vector database to reside entirely on-device. Results demonstrate that HAVEN improves reranking throughput by up to 20x and latency up to 40x compared to GPU-DRAM and GPU-SSD systems across billion-scale datasets.

Key Contribution

Forget slow PCIe transfers: HAVEN's high-bandwidth flash-augmented GPU architecture delivers up to 40x faster nearest neighbor search for RAG.

Abstract

Retrieval-Augmented Generation (RAG) relies on large-scale Approximate Nearest Neighbor Search (ANNS) to retrieve semantically relevant context for large language models. Among ANNS methods, IVF-PQ offers an attractive balance between memory efficiency and search accuracy. However, achieving high recall requires reranking which fetches full-precision vectors for reranking, and the billion-scale vector databases need to reside in CPU DRAM or SSD due to the limited capacity of GPU HBM. This off-GPU data movement introduces substantial latency and throughput degradation. We propose HAVEN, a GPU architecture augmented with High-Bandwidth Flash (HBF) which is a recently introduced die-stacked 3D NAND technology engineered to deliver terabyte-scale capacity and hundreds of GB/s read bandwidth. By integrating HBF and near-storage search unit as an on-package complement to HBM, HAVEN enables the full-precision vector database to reside entirely on-device, eliminating PCIe and DDR bottlenecks during reranking. Through detailed modeling of re-architected 3D NAND subarrays, power-constrained HBF bandwidth, and end-to-end IVF-PQ pipelines, we demonstrate that HAVEN improves reranking throughput by up to 20x and latency up to 40x across billion-scale datasets compared to GPU-DRAM and GPU-SSD systems. Our results show that HBF-augmented GPUs enable high-recall retrieval at throughput previously achievable only without reranking, offering a promising direction for memory-centric AI accelerators.

Distributed Systems & Hardware Inference & Quantization Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...