Search papers, labs, and topics across Lattice.
This paper introduces multilingual multi-hop question answering (MM-hop) benchmarks by translating English-only datasets into five languages to evaluate RAG systems in multilingual settings. It then proposes DaPT, a dual-path RAG framework that generates and merges sub-question graphs in both the source language and English before retrieval and answering. Experiments show that DaPT significantly outperforms existing RAG systems, achieving an 18.3% relative improvement on the MuSiQue benchmark.
Multilingual question answering is harder than you think: even state-of-the-art RAG systems stumble when dealing with questions and knowledge in multiple languages.
Retrieval-augmented generation (RAG) systems have made significant progress in solving complex multi-hop question answering (QA) tasks in the English scenario. However, RAG systems inevitably face the application scenario of retrieving across multilingual corpora and queries, leaving several open challenges. The first one involves the absence of benchmarks that assess RAG systems'capabilities under the multilingual multi-hop (MM-hop) QA setting. The second centers on the overreliance on LLMs'strong semantic understanding in English, which diminishes effectiveness in multilingual scenarios. To address these challenges, we first construct multilingual multi-hop QA benchmarks by translating English-only benchmarks into five languages, and then we propose DaPT, a novel multilingual RAG framework. DaPT generates sub-question graphs in parallel for both the source-language query and its English translation counterpart, then merges them before employing a bilingual retrieval-and-answer strategy to sequentially solve sub-questions. Our experimental results demonstrate that advanced RAG systems suffer from a significant performance imbalance in multilingual scenarios. Furthermore, our proposed method consistently yields more accurate and concise answers compared to the baselines, significantly enhancing RAG performance on this task. For instance, on the most challenging MuSiQue benchmark, DaPT achieves a relative improvement of 18.3\% in average EM score over the strongest baseline.