Search papers, labs, and topics across Lattice.
This paper analyzes the failure modes of single-vector embeddings on the LIMIT retrieval benchmark, showing that dimensionality alone does not explain their poor performance. Instead, domain shift, misalignment between embedding similarity and relevance, and the "drowning" effect contribute significantly to the observed limitations. Finetuning mitigates some issues, but single-vector models still underperform multi-vector models and suffer from catastrophic forgetting when finetuned on LIMIT-like datasets.
Single-vector embeddings' retrieval failures aren't just about dimensionality; they're fundamentally hobbled by domain shift, relevance misalignment, and a "drowning" effect that multi-vector models handle far better.
Recent work (Weller et al., 2025) introduced a naturalistic dataset called LIMIT and showed empirically that a wide range of popular single-vector embedding models suffer substantial drops in retrieval quality, raising concerns about the reliability of single-vector embeddings for retrieval. Although (Weller et al., 2025) proposed limited dimensionality as the main factor contributing to this, we show that dimensionality alone cannot explain the observed failures. We observe from results in (Alon et al., 2016) that $2k+1$-dimensional vector embeddings suffice for top-$k$ retrieval. This result points to other drivers of poor performance. Controlling for tokenization artifacts and linguistic similarity between attributes yields only modest gains. In contrast, we find that domain shift and misalignment between embedding similarities and the task's underlying notion of relevance are major contributors; finetuning mitigates these effects and can improve recall substantially. Even with finetuning, however, single-vector models remain markedly weaker than multi-vector representations, pointing to fundamental limitations. Moreover, finetuning single-vector models on LIMIT-like datasets leads to catastrophic forgetting (performance on MSMARCO drops by more than 40%), whereas forgetting for multi-vector models is minimal. To better understand the gap between performance of single-vector and multi-vector models, we study the drowning in documents paradox (Reimers \& Gurevych, 2021; Jacob et al., 2025): as the corpus grows, relevant documents are increasingly "drowned out" because embedding similarities behave, in part, like noisy statistical proxies for relevance. Through experiments and mathematical calculations on toy mathematical models, we illustrate why single-vector models are more susceptible to drowning effects compared to multi-vector models.