Search papers, labs, and topics across Lattice.
This paper introduces CompRank, a novel reranking framework that enhances the efficiency of large language model (LLM) rerankers by implementing token-level compression and a decoding-free scoring mechanism. By decoupling document representations from candidate order and query context, CompRank achieves significant computational savings while maintaining competitive reranking performance, as evidenced by an average NDCG@10 of 39.2 with only 10.2% of document tokens retained. The framework demonstrates remarkable scalability, achieving up to 9.5 times speedup over traditional methods on large candidate lists, making it a promising solution for practical applications in retrieval-augmented generation systems.
Achieving nearly 10 times faster reranking without sacrificing performance, CompRank revolutionizes the efficiency of LLMs in retrieval tasks.
Large language model (LLM) rerankers have become an important component of modern retrieval and retrieval-augmented generation pipelines, but their high computational cost limits their applicability to long candidate lists. In this paper, we propose \textbf{CompRank}, a token-efficient reranking framework that reduces redundant computation by aligning reranker design with the sparsity of ranking signals. CompRank decouples document representations from candidate order and query context, enabling reusable document-side states; applies segment-wise token compression to reduce query--document interaction cost; and introduces a CopyNet-style objective that directly aligns attention-based document scoring with training supervision. Experiments on seven BEIR datasets show that CompRank achieves strong reranking performance while retaining only 10.2\% of document tokens, reaching an average NDCG@10 of 39.2 compared with 39.7 under full-token attention. Further scaling experiments on TREC-COVID show that CompRank remains stable when evaluated on candidate lists of up to 500 documents after training on 30-document lists, while achieving $4.9\times$--$9.5\times$ end-to-end speedup over generation-based listwise reranking and approximately $1.3\times$ speedup over the full-token CompRank variant. These results suggest that token-level compression and decoding-free attention scoring provide an effective path toward scalable LLM-based reranking.