Search papers, labs, and topics across Lattice.
The paper introduces ReFeed, a novel framework for generating query rewriting datasets that are sensitive to the stylistic characteristics of target documents in retrieval systems. ReFeed identifies failed retrieval cases, uses LLMs to rewrite queries to match the style of relevant documents, and validates improvements through re-retrieval, creating a corpus of (original, rewritten) query pairs. Experiments demonstrate that training rewriter models on ReFeed-generated data improves retrieval performance by aligning query style with document style, enhancing the adaptability of RAG systems.
LLMs can be prompted to rewrite queries in the style of relevant documents, creating datasets that dramatically improve retrieval performance by aligning with domain-specific language.
Retrieval systems often fail when user queries differ stylistically or semantically from the language used in domain documents. Query rewriting has been proposed to bridge this gap, improving retrieval by reformulating user queries into semantically equivalent forms. However, most existing approaches overlook the stylistic characteristics of target documents-their domain-specific phrasing, tone, and structure-which are crucial for matching real-world data distributions. We introduce a retrieval feedback-driven dataset generation framework that automatically identifies failed retrieval cases, leverages large language models to rewrite queries in the style of relevant documents, and verifies improvement through re-retrieval. The resulting corpus of (original, rewritten) query pairs enables the training of rewriter models that are explicitly aware of document style and retrieval feedback. This work highlights a new direction in data-centric information retrieval, emphasizing how feedback loops and document-style alignment can enhance the reasoning and adaptability of RAG systems in real-world, domain-specific contexts.