RadboudMay 5, 2026arXiv:2605.03824

Reproducing Complex Set-Compositional Information Retrieval

Vincent Degenhart, Dewi Timman, Arjen P. de Vries, Faegheh Hasibi, Mohanna Hoveyda

AI Summary

This paper benchmarks neural and reasoning-targeted retrieval methods on set-compositional information retrieval tasks, including a newly introduced LIMIT+ dataset designed to minimize reliance on pretrained knowledge. Results show that while neural retrievers excel on QUEST, their performance plummets on LIMIT+, where lexical methods dominate, indicating a reliance on semantic shortcuts rather than genuine constraint satisfaction. Further analysis reveals performance degradation across all methods with increasing compositional depth, with dense methods exhibiting the most significant collapse.

Key Contribution

Neural retrievers, despite their success on standard benchmarks, fail spectacularly when forced to reason about set-theoretic constraints, revealing a reliance on spurious correlations rather than true compositional understanding.

Abstract

Complex information needs may involve set-compositional queries using conjunction, disjunction, and exclusion, yet it remains unclear whether current retrieval paradigms genuinely satisfy such constraints or exploit `semantic shortcuts'. We conduct a reproducibility study to benchmark major retrieval families and reasoning-targeted methods on QUEST and QUEST+Variants, and introduce LIMIT+, a controlled benchmark where relevance depends on arbitrary attribute predicates and constraint satisfaction, and less on pretrained knowledge. Our findings show that (i) on QUEST, the best neural retrievers achieve an effectiveness that is more than double what can be achieved with BM25 (Recall@100 ${>}$0.41 vs.\ 0.20), but reasoning-targeted methods like ReasonIR and Search-R1 do not outperform general-purpose retrievers uniformly; (ii) on LIMIT+, gains fail to transfer, where the strongest QUEST method collapses from Recall@100${\approx}$0.42 to below 0.02, while classic lexical retrieval gains to ${\sim}$0.96. Lastly, (iii) stratifying by compositional depth reveals a consistent degradation across all methods, where algebraic sparse and lexical methods show more stable performance while dense approaches collapse. We release code and LIMIT+ data generation scripts to support future reproducibility and controlled evaluation.

Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References32

Year2026

VenueN/A

Related Papers

Finding related papers...