Search papers, labs, and topics across Lattice.
The paper introduces MTRAG-UN, a new benchmark dataset designed to evaluate multi-turn retrieval-augmented generation (RAG) models on challenging conversational scenarios. The benchmark comprises 666 tasks with over 2,800 turns across 6 domains, focusing on UNanswerable, UNderspecified, NONstandalone questions, and UNclear responses. Experiments using the benchmark reveal that current RAG models still face difficulties in handling these complex conversational dynamics.
Multi-turn RAG models still stumble on conversations with unanswerable questions and unclear responses, as shown by a new benchmark.
We present MTRAG-UN, a benchmark for exploring open challenges in multi-turn retrieval augmented generation, a popular use of large language models. We release a benchmark of 666 tasks containing over 2,800 conversation turns across 6 domains with accompanying corpora. Our experiments show that retrieval and generation models continue to struggle on conversations with UNanswerable, UNderspecified, and NONstandalone questions and UNclear responses. Our benchmark is available at https://github.com/IBM/mt-rag-benchmark