Johannes Gutenberg University MainzMar 16, 2026arXiv:2603.15486

Cuckoo-GPU: Accelerating Cuckoo Filters on Modern GPUs

Tim Dortmann, Markus Vieth, Bertil Schmidt

AI Summary

Cuckoo-GPU, a high-performance Cuckoo filter library for GPUs, was developed to address the performance limitations of existing GPU-based dynamic Approximate Membership Query (AMQ) structures. The library uses a lock-free architecture with atomic compare-and-swap operations and a breadth-first search-based eviction heuristic to maximize global memory bandwidth utilization. Benchmarks on NVIDIA GH200 and RTX PRO 6000 GPUs demonstrate that Cuckoo-GPU significantly outperforms existing dynamic AMQ structures, achieving up to 378x higher insertion throughput than GQF and rivaling the query throughput of append-only Bloom filters.

Key Contribution

Cuckoo filters on GPUs can now rival Bloom filters in query speed, finally making dynamic approximate membership queries practical for high-throughput systems.

Abstract

Approximate Membership Query (AMQ) structures are essential for high-throughput systems in databases, networking, and bioinformatics. While Bloom filters offer speed, they lack support for deletions. Existing GPU-based dynamic alternatives, such as the Two-Choice Filter (TCF) and GPU Quotient Filter (GQF), enable deletions but incur severe performance penalties. We present Cuckoo-GPU, an open-source, high-performance Cuckoo filter library for GPUs. Instead of prioritizing cache locality, Cuckoo-GPU embraces the inherently random access pattern of Cuckoo hashing to fully saturate global memory bandwidth. Our design features a lock-free architecture built on atomic compare-and-swap operations, paired with a novel breadth-first search-based eviction heuristic that minimizes thread divergence and bounds sequential memory accesses during high-load insertions. Evaluated on NVIDIA GH200 (HBM3) and RTX PRO 6000 Blackwell (GDDR7) systems, Cuckoo-GPU closes the performance gap between append-only and dynamic AMQ structures. It achieves insertion, query, and deletion throughputs up to 378x (4.1x), 6x (34.7x), and 258x (107x) higher than GQF (TCF) on the same hardware, respectively, and delivers up to a 350x speedup over the fastest available multi-threaded CPU-based Cuckoo filter implementation. Moreover, its query throughput rivals that of the append-only GPU-based Blocked Bloom filter - demonstrating that dynamic AMQ structures can be deployed on modern accelerators without sacrificing performance.

Architecture Design (Transformers, SSMs, MoE)Distributed Systems & Hardware Inference & Quantization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Cuckoo-GPU: Accelerating Cuckoo Filters on Modern GPUs

Related Papers