Search papers, labs, and topics across Lattice.
The paper introduces Intent-Driven Dynamic Chunking (IDC), a novel document segmentation approach that leverages a Large Language Model to predict user queries and a dynamic programming algorithm to optimize chunk boundaries based on these predicted intents. IDC aims to address the limitations of traditional chunking methods that ignore user intent, leading to suboptimal retrieval performance. Experiments on six QA datasets demonstrate that IDC outperforms traditional chunking strategies, improving top-1 retrieval accuracy by 5-67% while generating 40-60% fewer chunks.
Forget fixed-length chunks: this new method uses predicted user queries to dynamically segment documents, boosting retrieval accuracy by up to 67% while drastically reducing the number of chunks.
Breaking long documents into smaller segments is a fundamental challenge in information retrieval. Whether for search engines, question-answering systems, or retrieval-augmented generation (RAG), effective segmentation determines how well systems can locate and return relevant information. However, traditional methods, such as fixed-length or coherence-based segmentation, ignore user intent, leading to chunks that split answers or contain irrelevant noise. We introduce Intent-Driven Dynamic Chunking (IDC), a novel approach that uses predicted user queries to guide document segmentation. IDC leverages a Large Language Model to generate likely user intents for a document and then employs a dynamic programming algorithm to find the globally optimal chunk boundaries. This represents a novel application of DP to intent-aware segmentation that avoids greedy pitfalls. We evaluated IDC on six diverse question-answering datasets, including news articles, Wikipedia, academic papers, and technical documentation. IDC outperformed traditional chunking strategies on five datasets, improving top-1 retrieval accuracy by 5% to 67%, and matched the best baseline on the sixth. Additionally, IDC produced 40-60% fewer chunks than baseline methods while achieving 93-100% answer coverage. These results demonstrate that aligning document structure with anticipated information needs significantly boosts retrieval performance, particularly for long and heterogeneous documents.