CASUvAApr 17, 2026arXiv:2604.16576

On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

Yongkang Li, Panagiotis Eustratiadis, Yixing Fan, Evangelos Kanoulas

AI Summary

This paper systematically analyzes the robustness of LLM-based dense retrievers, evaluating their generalizability across 30 datasets and stability against query variations and adversarial attacks. The study reveals a "specialization tax" where models optimized for complex reasoning exhibit poor generalizability, while also showing vulnerabilities to semantic perturbations despite improved resilience to typos and corpus poisoning. Embedding geometry is found to be predictive of lexical stability, suggesting a link between model scaling and robustness.

Key Contribution

LLM-based retrievers optimized for complex reasoning often stumble in broader contexts, suffering a "specialization tax" that limits their generalizability.

Abstract

Decoder-only large language models (LLMs) are increasingly replacing BERT-style architectures as the backbone for dense retrieval, achieving substantial performance gains and broad adoption. However, the robustness of these LLM-based retrievers remains underexplored. In this paper, we present the first systematic study of the robustness of state-of-the-art open-source LLM-based dense retrievers from two complementary perspectives: generalizability and stability. For generalizability, we evaluate retrieval effectiveness across four benchmarks spanning 30 datasets, using linear mixed-effects models to estimate marginal mean performance and disentangle intrinsic model capability from dataset heterogeneity. Our analysis reveals that while instruction-tuned models generally excel, those optimized for complex reasoning often suffer a ``specialization tax,''exhibiting limited generalizability in broader contexts. For stability, we assess model resilience against both unintentional query variations~(e.g., paraphrasing, typos) and malicious adversarial attacks~(e.g., corpus poisoning). We find that LLM-based retrievers show improved robustness against typos and corpus poisoning compared to encoder-only baselines, yet remain vulnerable to semantic perturbations like synonymizing. Further analysis shows that embedding geometry (e.g., angular uniformity) provides predictive signals for lexical stability and suggests that scaling model size generally improves robustness. These findings inform future robustness-aware retriever design and principled benchmarking. Our code is publicly available at https://github.com/liyongkang123/Robust_LLM_Retriever_Eval.

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References77

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

Related Papers