Apple MLFeb 26, 2026arXiv:2602.23234

Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments

Evangelia Christakopoulou, Evangelia Christakopoulou, Vivekkumar Patel, Hemanth Velaga, Hemanth Velaga, Sandip Gaikwad, Sandip T. Gaikwad

AI Summary

The paper addresses the challenge of limited expert-provided textual relevance labels in app store search ranking by using LLMs to generate additional labels. They systematically evaluated LLM configurations and found that a specialized, fine-tuned model outperforms larger pre-trained models in providing high-quality textual relevance labels. Augmenting the production ranker with these LLM-generated labels resulted in significant improvements in both offline NDCG and a +0.24% increase in conversion rate in a worldwide A/B test on the App Store ranker, particularly for tail queries.

Key Contribution

Fine-tuning a specialized LLM to generate textual relevance labels for search ranking not only beats larger pre-trained models, but also drives significant real-world gains in App Store conversion rates, especially for tail queries.

Abstract

Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. To maximize relevance, we leverage two complementary objectives: behavioral relevance (results users tend to click or download) and textual relevance (a result's semantic fit to the query). A persistent challenge is the scarcity of expert-provided textual relevance labels relative to abundant behavioral relevance labels. We first address this by systematically evaluating LLM configurations, finding that a specialized, fine-tuned model significantly outperforms a much larger pre-trained one in providing highly relevant labels. Using this optimal model as a force multiplier, we generate millions of textual relevance labels to overcome the data scarcity. We show that augmenting our production ranker with these textual relevance labels leads to a significant outward shift of the Pareto frontier: offline NDCG improves for behavioral relevance while simultaneously increasing for textual relevance. These offline gains were validated by a worldwide A/B test on the App Store ranker, which demonstrated a statistically significant +0.24% increase in conversion rate, with the most substantial performance gains occurring in tail queries, where the new textual relevance labels provide a robust signal in the absence of reliable behavioral relevance labels.

Data Curation & Synthetic Data Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References33

Year2026

VenueN/A

Related Papers

Finding related papers...