RutgersMar 17, 2026arXiv:2603.17205

OPERA: Online Data Pruning for Efficient Retrieval Model Adaptation

Haoyang Fang, Shuai Zhang, Yifei Ma, Hengyi Wang, Cuixiong Hu, Katrin Kirchhoff, Bernie Wang, George Karypis

AI Summary

The paper introduces OPERA, a data pruning framework for efficient domain adaptation of dense retrieval models. OPERA explores static pruning (SP) based on query-document similarity and proposes a two-stage dynamic pruning (DP) strategy that adaptively modulates sampling probabilities at query and document levels. Experiments across eight datasets show that DP improves both ranking (NDCG@10 +1.9\%) and retrieval (Recall@20 +0.7\%) while reducing training time by over 50\%.

Key Contribution

Forget full finetuning: OPERA's dynamic pruning lets you adapt retrieval models to new domains with better ranking and recall, in half the time.

Abstract

Domain-specific finetuning is essential for dense retrievers, yet not all training pairs contribute equally to the learning process. We introduce OPERA, a data pruning framework that exploits this heterogeneity to improve both the effectiveness and efficiency of retrieval model adaptation. We first investigate static pruning (SP), which retains only high-similarity query-document pairs, revealing an intrinsic quality-coverage tradeoff: ranking (NDCG) improves while retrieval (Recall) can degrade due to reduced query diversity. To resolve this tradeoff, we propose a two-stage dynamic pruning (DP) strategy that adaptively modulates sampling probabilities at both query and document levels throughout training, prioritizing high-quality examples while maintaining access to the full training set. Evaluations across eight datasets spanning six domains demonstrate the effectiveness of both approaches: SP improves ranking over standard finetuning (NDCG@10 +0.5\%), while DP achieves the strongest performance on both ranking (NDCG@10 +1.9\%) and retrieval (Recall@20 +0.7\%), with an average rank of 1.38 across all methods. These findings scale to Qwen3-Embedding, an LLM-based dense retriever, confirming architecture-agnostic benefits. Notably, DP reaches comparable performance in less than 50\% of the training time required by standard finetuning.

Inference & Quantization Recommendation & Information Retrieval Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References50

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OPERA: Online Data Pruning for Efficient Retrieval Model Adaptation

Related Papers