Search papers, labs, and topics across Lattice.
The paper introduces RRK, a listwise reranker that compresses documents into fixed-size embeddings to improve efficiency. Trained via distillation, RRK achieves significant speedups (3x-18x) compared to smaller rerankers while maintaining or improving effectiveness, especially on long documents. This approach leverages rich compressed representations to enable efficient listwise reranking with large language models.
Forget slow reranking: this new method compresses documents into embeddings, letting an 8B parameter model run up to 18x faster than smaller models with better accuracy.
Reranking, the process of refining the output from a first-stage retriever, is often considered computationally expensive, especially when using Large Language Models (LLMs). A common approach to mitigate this cost involves utilizing smaller LLMs or controlling input length. Inspired by recent advances in document compression for retrieval-augmented generation (RAG), we introduce RRK, an efficient and effective listwise reranker compressing documents into multi-token fixed-size embedding representations. Our simple training via distillation shows that this combination of rich compressed representations and listwise reranking yields a highly efficient and effective system. In particular, our 8B-parameter model runs 3x-18x faster than smaller rerankers (0.6-4B parameters) while matching or outperforming them in effectiveness. The efficiency gains are even more striking on long-document benchmarks, where RRK widens its advantage further.