Search papers, labs, and topics across Lattice.
The paper introduces Evo-Retriever, a multimodal document retrieval framework that uses an LLM to guide curriculum evolution based on a novel Viewpoint-Pathway collaboration. This collaboration involves multi-view image alignment for fine-grained matching and bidirectional contrastive learning to generate hard queries and complementary learning paths. By adaptively adjusting the training curriculum based on model-state summaries, Evo-Retriever achieves state-of-the-art performance on ViDoRe V2 and MMEB (VisDoc) datasets.
LLMs can dynamically optimize the training curriculum of multimodal retrieval models, leading to significant gains in retrieval accuracy by adapting to the model's evolving state.
Visual-language models (VLMs) excel at data mappings, but real-world document heterogeneity and unstructuredness disrupt the consistency of cross-modal embeddings. Recent late-interaction methods enhance image-text alignment through multi-vector representations, yet traditional training with limited samples and static strategies cannot adapt to the model's dynamic evolution, causing cross-modal retrieval confusion. To overcome this, we introduce Evo-Retriever, a retrieval framework featuring an LLM-guided curriculum evolution built upon a novel Viewpoint-Pathway collaboration. First, we employ multi-view image alignment to enhance fine-grained matching via multi-scale and multi-directional perspectives. Then, a bidirectional contrastive learning strategy generates "hard queries" and establishes complementary learning paths for visual and textual disambiguation to rebalance supervision. Finally, the model-state summary from the above collaboration is fed into an LLM meta-controller, which adaptively adjusts the training curriculum using expert knowledge to promote the model's evolution. On ViDoRe V2 and MMEB (VisDoc), Evo-Retriever achieves state-of-the-art performance, with nDCG@5 scores of 65.2% and 77.1%.