Search papers, labs, and topics across Lattice.
This paper introduces a novel token pruning framework for late-interaction retrieval models like ColBERT, grounded in hyperspace geometry and Voronoi cell estimation. The method interprets each token's influence based on its Voronoi region in the embedding space, enabling principled pruning that balances retrieval quality and index size reduction. Experiments demonstrate the approach's effectiveness as a pruning strategy and its utility for improving and interpreting token-level behavior in dense retrieval.
Token pruning in dense retrieval gets a geometric upgrade: Voronoi cells offer a principled way to shrink your index without sacrificing search quality.
Late-interaction models like ColBERT offer a competitive performance across various retrieval tasks, but require storing a dense embedding for each document token, leading to a substantial index storage overhead. Past works address this by attempting to prune low-importance token embeddings based on statistical and empirical measures, but they often either lack formal grounding or are ineffective. To address these shortcomings, we introduce a framework grounded in hyperspace geometry and cast token pruning as a Voronoi cell estimation problem in the embedding space. By interpreting each token's influence as a measure of its Voronoi region, our approach enables principled pruning that retains retrieval quality while reducing index size. Through our experiments, we demonstrate that this approach serves not only as a competitive pruning strategy but also as a valuable tool for improving and interpreting token-level behavior within dense retrieval systems.