Apr 23, 2026arXiv:2604.21645

Large-Scale Data Parallelization of Product Quantization and Inverted Indexing Using Dask

Ashley Abraham, Andrew Strelzoff, Haley R. Dozier, Althea C. Henslee, Mark Chappell

AI Summary

This paper introduces a distributed implementation of Product Quantization (PQ) and Inverted Indexing for approximate nearest neighbor search using Dask to handle large-scale datasets. The approach partitions data for parallel processing, enabling efficient clustering and indexing without sacrificing accuracy. Experiments demonstrate that this distributed method reduces computational requirements to levels comparable to medium-scale data processing.

Key Contribution

Scale up your nearest neighbor search without blowing your budget: this work shows how to use Dask to parallelize Product Quantization and Inverted Indexing, achieving accuracy comparable to single-machine methods on much larger datasets.

Abstract

Large-scale Nearest Neighbor (NN) search, though widely utilized in the similarity search field, remains challenged by the computational limitations inherent in processing large scale data. In an effort to decrease the computational expense needed, Approximate Nearest Neighbor (ANN) search is often used in applications that do not require the exact similarity search, but instead can rely on an approximation. Product Quantization (PQ) is a memory-efficient ANN effective for clustering all sizes of datasets. Clustering large-scale, high dimensional data requires a heavy computational expense, in both memory-cost and execution time. This work focuses on a unique way to divide and conquer the large scale data in Python using PQ, Inverted Indexing and Dask, combining the results without compromising the accuracy and reducing computational requirements to the level required when using medium-scale data.

Distributed Systems & Hardware Inference & Quantization Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References22

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Large-Scale Data Parallelization of Product Quantization and Inverted Indexing Using Dask

Related Papers