Search papers, labs, and topics across Lattice.
Cuckoo-GPU, a high-performance Cuckoo filter library for GPUs, was developed to address the performance limitations of existing GPU-based dynamic Approximate Membership Query (AMQ) structures. The library uses a lock-free architecture with atomic compare-and-swap operations and a breadth-first search-based eviction heuristic to maximize global memory bandwidth utilization. Benchmarks on NVIDIA GH200 and RTX PRO 6000 GPUs demonstrate that Cuckoo-GPU significantly outperforms existing dynamic AMQ structures, achieving up to 378x higher insertion throughput than GQF and rivaling the query throughput of append-only Bloom filters.
Cuckoo filters on GPUs can now rival Bloom filters in query speed, finally making dynamic approximate membership queries practical for high-throughput systems.
Approximate Membership Query (AMQ) structures are essential for high-throughput systems in databases, networking, and bioinformatics. While Bloom filters offer speed, they lack support for deletions. Existing GPU-based dynamic alternatives, such as the Two-Choice Filter (TCF) and GPU Quotient Filter (GQF), enable deletions but incur severe performance penalties. We present Cuckoo-GPU, an open-source, high-performance Cuckoo filter library for GPUs. Instead of prioritizing cache locality, Cuckoo-GPU embraces the inherently random access pattern of Cuckoo hashing to fully saturate global memory bandwidth. Our design features a lock-free architecture built on atomic compare-and-swap operations, paired with a novel breadth-first search-based eviction heuristic that minimizes thread divergence and bounds sequential memory accesses during high-load insertions. Evaluated on NVIDIA GH200 (HBM3) and RTX PRO 6000 Blackwell (GDDR7) systems, Cuckoo-GPU closes the performance gap between append-only and dynamic AMQ structures. It achieves insertion, query, and deletion throughputs up to 378x (4.1x), 6x (34.7x), and 258x (107x) higher than GQF (TCF) on the same hardware, respectively, and delivers up to a 350x speedup over the fastest available multi-threaded CPU-based Cuckoo filter implementation. Moreover, its query throughput rivals that of the append-only GPU-based Blocked Bloom filter - demonstrating that dynamic AMQ structures can be deployed on modern accelerators without sacrificing performance.