NAVER LabsFeb 18, 2026arXiv:2602.16136

Retrieval Collapses When AI Pollutes the Web

Hongyeon Yu, Hongyeon Yu, Dongchan Kim, Dongchan Kim, Young-Bum Kim, Young-Bum Kim

AI Summary

The paper introduces and analyzes "Retrieval Collapse," a failure mode in information retrieval where AI-generated content dominates search results, leading to reduced source diversity and potential infiltration of low-quality or adversarial content. Through controlled experiments with SEO-style and adversarial content, the authors demonstrate that even when answer accuracy appears stable, retrieval pipelines can become heavily reliant on synthetic sources. They find that while LLM-based rankers can suppress adversarial content better than BM25, SEO-style AI content can still lead to high levels of pool and exposure contamination.

Key Contribution

RAG systems can become dangerously reliant on AI-generated content even when accuracy seems stable, creating a "Retrieval Collapse" where synthetic evidence dominates.

Abstract

The rapid proliferation of AI-generated content on the Web presents a structural risk to information retrieval, as search engines and Retrieval-Augmented Generation (RAG) systems increasingly consume evidence produced by the Large Language Models (LLMs). We characterize this ecosystem-level failure mode as Retrieval Collapse, a two-stage process where (1) AI-generated content dominates search results, eroding source diversity, and (2) low-quality or adversarial content infiltrates the retrieval pipeline. We analyzed this dynamic through controlled experiments involving both high-quality SEO-style content and adversarially crafted content. In the SEO scenario, a 67\% pool contamination led to over 80\% exposure contamination, creating a homogenized yet deceptively healthy state where answer accuracy remains stable despite the reliance on synthetic sources. Conversely, under adversarial contamination, baselines like BM25 exposed $\sim$19\% of harmful content, whereas LLM-based rankers demonstrated stronger suppression capabilities. These findings highlight the risk of retrieval pipelines quietly shifting toward synthetic evidence and the need for retrieval-aware strategies to prevent a self-reinforcing cycle of quality decline in Web-grounded systems.

Data Curation & Synthetic Data Recommendation & Information Retrieval Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References18

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Retrieval Collapses When AI Pollutes the Web

Related Papers