May 6, 2026arXiv:2605.04615

Beyond Retrieval: A Multitask Benchmark and Model for Code Search

Siqiao Xue, Zihan Liao, Jin Qin, Ziyin Zhang, Yixiang Mu, Fan Zhou, Hang Yu

AI Summary

The authors introduce CoREB, a multitask benchmark for code search that includes retrieval and reranking tasks, addressing limitations of existing benchmarks like data contamination and degenerate binary relevance. They benchmarked eleven embedding models and five rerankers across text-to-code, code-to-text, and code-to-code tasks, revealing that code-specialized embeddings excel in code-to-code retrieval but struggle with short keyword queries. A fine-tuned reranker, CoREB-Reranker, is shown to achieve consistent gains across all three tasks, demonstrating the importance of a full code search pipeline approach.

Key Contribution

Developer-style keyword searches completely nullify the advantage of even the best code embedding models, highlighting a critical gap in current code search techniques.

Abstract

Code search has usually been evaluated as first-stage retrieval, even though production systems rely on broader pipelines with reranking and developer-style queries. Existing benchmarks also suffer from data contamination, label noise, and degenerate binary relevance. In this paper, we introduce \textsc{CoREB}, a contamination-limited, multitask \underline{co}de \underline{r}etrieval and r\underline{e}ranking \underline{b}enchmark, together with a fine-tuned code reranker, that goes beyond retrieval to cover the full code search pipeline. \textsc{CoREB} is built from counterfactually rewritten LiveCodeBench problems in five programming languages and delivered as timed releases with graded relevance judgments. We benchmark eleven embedding models and five rerankers across three tasks: text-to-code, code-to-text, and code-to-code. Our experiments reveal that: \circone code-specialised embeddings dominate code-to-code retrieval (${\sim}2{\times}$ over general encoders), yet no single model wins all three tasks; \circtwo short keyword queries, the format closest to real developer search, collapse every model to near-zero nDCG@10; \circthree off-the-shelf rerankers are task-asymmetric, with a 12-point swing on code-to-code and no baseline net-positive across all tasks; \circfour our fine-tuned \textsc{CoREB-Reranker} is the first to achieve consistent gains across all three tasks. The data and model are released.

Code Generation & Program Synthesis Eval Frameworks & Benchmarks Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References51

Year2026

VenueN/A

Related Papers

Finding related papers...