Search papers, labs, and topics across Lattice.
This paper introduces GPIR, a GPU-accelerated private information retrieval (PIR) system that optimizes performance by addressing bottlenecks introduced by multi-client batching. GPIR employs a stage-aware hybrid execution model, dynamically switching between operation-level and stage-level kernels to maximize on-chip data reuse, and introduces a transposed-layout GEMM design with fine-grained pipelining for the RowSel phase. The system achieves up to 305.7x higher throughput than the previous state-of-the-art, PIRonGPU, and scales to multi-GPU systems with minimal communication overhead.
GPIR shatters the PIR performance barrier, achieving 300x speedups on GPUs by rethinking kernel design and data layout to overcome memory bottlenecks exposed by multi-client batching.
Private information retrieval (PIR) allows private database queries but is hindered by intense server-side computation and memory traffic. Modern lattice-based PIR protocols typically involve three phases: ExpandQuery (expanding a query into encrypted indices), RowSel (encrypted row selection), and ColTor (recursive"column tournament"for final selection). ExpandQuery and ColTor primarily perform number-theoretic transforms (NTTs), whereas RowSel reduces to large-scale independent matrix-matrix multiplications (GEMMs). GPUs are theoretically ideal for these tasks, provided multi-client batching is used to achieve high throughput. However, batching fundamentally reshapes performance bottlenecks; while it amortizes database access costs, it expands working sets beyond the L2 cache capacity, causing divergent memory behaviors and excessive DRAM traffic. We present GPIR, a GPU-accelerated PIR system that rethinks kernel design, data layout, and execution scheduling. We introduce a stage-aware hybrid execution model that dynamically switches between operation-level kernels, which execute each primitive operation separately, and stage-level kernels, which fuse all operations within a protocol stage into a single kernel to maximize on-chip data reuse. For RowSel, we identify a performance gap caused by a structural mismatch between NTT-driven data layouts and tiled GEMM access patterns, which is exacerbated by multi-client batching. We resolve this through a transposed-layout GEMM design and fine-grained pipelining. Finally, we extend GPIR to multi-GPU systems, scaling both query throughput and database capacity with negligible communication overhead. GPIR achieves up to 305.7x higher throughput than PIRonGPU, the state-of-the-art GPU implementation.