Search papers, labs, and topics across Lattice.
The paper addresses the challenge of limited expert-provided textual relevance labels in app store search ranking by using LLMs to generate additional labels. They systematically evaluated LLM configurations and found that a specialized, fine-tuned model outperforms larger pre-trained models in providing high-quality textual relevance labels. Augmenting the production ranker with these LLM-generated labels resulted in significant improvements in both offline NDCG and a +0.24% increase in conversion rate in a worldwide A/B test on the App Store ranker, particularly for tail queries.
Fine-tuning a specialized LLM to generate textual relevance labels for search ranking not only beats larger pre-trained models, but also drives significant real-world gains in App Store conversion rates, especially for tail queries.
Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. To maximize relevance, we leverage two complementary objectives: behavioral relevance (results users tend to click or download) and textual relevance (a result's semantic fit to the query). A persistent challenge is the scarcity of expert-provided textual relevance labels relative to abundant behavioral relevance labels. We first address this by systematically evaluating LLM configurations, finding that a specialized, fine-tuned model significantly outperforms a much larger pre-trained one in providing highly relevant labels. Using this optimal model as a force multiplier, we generate millions of textual relevance labels to overcome the data scarcity. We show that augmenting our production ranker with these textual relevance labels leads to a significant outward shift of the Pareto frontier: offline NDCG improves for behavioral relevance while simultaneously increasing for textual relevance. These offline gains were validated by a worldwide A/B test on the App Store ranker, which demonstrated a statistically significant +0.24% increase in conversion rate, with the most substantial performance gains occurring in tail queries, where the new textual relevance labels provide a robust signal in the absence of reliable behavioral relevance labels.