Search papers, labs, and topics across Lattice.
This paper investigates the suitability of self-supervised vision representations for content-based image retrieval (CBIR) using vector databases and approximate nearest neighbor (ANN) search. It finds that the latent space geometry of these representations, particularly anisotropy and skewness, significantly impacts ANN indexing performance. The study demonstrates that representations with higher isotropy and local purity lead to better semantic retrieval performance by aligning better with the distance-based assumptions of ANN indexes.
Self-supervised vision models that ace linear probing can still flop at semantic image retrieval because of skewed latent space geometry that breaks approximate nearest neighbor search.
Content-based image retrieval (CBIR) systems enable users to search images based on visual content instead of relying on metadata. The text domain has benefited from vector search of representations created with unsupervised methods such as BERT. However, modern self-supervised learning methods for vision are mostly not reported in CBIR-related literature, instead relying on supervised models or multi-modal methods that align text and vision. We evaluate how the representations learned by modern self-supervised learning methods for vision perform under typical retrieval stacks that leverage vector databases and nearest neighbor search. Our evaluation reveals that the latent space geometry impacts approximate nearest neighbor (ANN) indexing. Specifically, highly anisotropic representations with high skewness produced by several modern SSL methods degrade the performance of partition-based and hashing-based search, even if their own linear probe or K-NN accuracy is not affected. In contrast, representations with higher isotropy and local purity better satisfy the distance-based assumptions of ANN indexes, leading to improved semantic retrieval performance.