Search papers, labs, and topics across Lattice.
This paper reproduces and compares different knowledge distillation strategies for transformer-based cross-encoders in information retrieval, including distillation from LLM re-rankers and ensemble of cross-encoder teachers. They fine-tune a range of cross-encoders (BERT, RoBERTa, ELECTRA, DeBERTa-v3, ModernBERT) using different supervised objectives and evaluate them on in-domain and out-of-domain datasets. The results demonstrate that pairwise MarginMSE and listwise InfoNCE objectives consistently outperform pointwise baselines, and that objective choice can be as impactful as scaling the backbone architecture.
Forget bigger models: smarter training objectives like pairwise MarginMSE and listwise InfoNCE can boost cross-encoder performance as much as scaling the backbone architecture.
Recent advances in Information Retrieval have established transformer-based cross-encoders as a keystone in IR. Recent studies have focused on knowledge distillation and showed that, with the right strategy, traditional cross-encoders could reach the level of effectiveness of LLM re-rankers. Yet, comparisons with previous training strategies, including distillation from strong cross-encoder teachers, remain unclear. In addition, few studies cover a similar range of backbone encoders, while substantial improvements have been made in this area since BERT. This lack of comprehensive studies in controlled environments makes it difficult to identify robust design choices. In this work, we reproduce \citet{schlattRankDistiLLMClosingEffectiveness2025} LLM-based distillation strategy and compare it to \citet{hofstatterImprovingEfficientNeural2020} approach based on an ensemble of cross-encoder teachers, as well as other supervised objectives, to fine-tune a large range of cross-encoders, from the original BERT and its follow-ups RoBERTa, ELECTRA and DeBERTa-v3, to the more recent ModernBERT. We evaluate all models on both in-domain (TREC-DL and MS~MARCO dev) and out-of-domain datasets (BEIR, LoTTE, and Robust04). Our results show that objectives emphasizing relative comparisons -- pairwise MarginMSE and listwise InfoNCE -- consistently outperform pointwise baselines across all backbones and evaluation settings, and that objective choice can yield gains comparable to scaling the backbone architecture.