Search papers, labs, and topics across Lattice.
This paper investigates the impact of temporal corpus drift on the FreshStack retrieval benchmark, which focuses on technical domains. By comparing two snapshots of the corpus from October 2024 and October 2025, the authors analyze how the relevance of queries changes due to factors like API deprecations and code reorganizations. The key finding is that while relevant documents may migrate between repositories (e.g., from LangChain to LlamaIndex), the overall ranking of retrieval models remains relatively stable, suggesting that re-judged benchmarks can remain reliable.
Despite codebases evolving rapidly, retrieval benchmarks can remain surprisingly robust even when re-judged on newer versions of the corpus.
Information retrieval (IR) benchmarks typically follow the Cranfield paradigm, relying on static and predefined corpora. However, temporal changes in technical corpora, such as API deprecations and code reorganizations, can render existing benchmarks stale. In our work, we investigate how temporal corpus drift affects FreshStack, a retrieval benchmark focused on technical domains. We examine two independent corpus snapshots of FreshStack from October 2024 and October 2025 to answer questions about LangChain. Our analysis shows that all but one query posed in 2024 remain fully supported by the 2025 corpus, as relevant documents "migrate" from LangChain to competitor repositories, such as LlamaIndex. Next, we compare the accuracy of retrieval models on both snapshots and observe only minor shifts in model rankings, with overall strong correlation of up to 0.978 Kendall $τ$ at Recall@50. These results suggest that retrieval benchmarks re-judged with evolving temporal corpora can remain reliable for retrieval evaluation. We publicly release all our artifacts at https://github.com/fresh-stack/driftbench.