Search papers, labs, and topics across Lattice.
The paper introduces DoGMaTiQ, a three-stage pipeline for automatically generating high-quality question-answer (QA) based nuggets for evaluating long-form, citation-backed reports, particularly in cross-lingual settings. DoGMaTiQ leverages document-grounded nugget generation, paraphrase clustering, and quality-based nugget subselection. Experiments on the NeuCLIR and RAGTIME datasets demonstrate that DoGMaTiQ-generated nuggets, when integrated with the AutoArgue framework, achieve strong rank correlations with human judgments in evaluating generated reports.
Stop hand-crafting QA datasets for evaluating RAG systems: DoGMaTiQ automates the process with surprisingly high correlation to human judgment, even across languages.
Evaluation of long-form, citation-backed reports has lately received significant attention due to the wide-scale adoption of retrieval-augmented generation (RAG) systems. Core to many evaluation frameworks is the use of atomic facts, or nuggets, to assess a report's coverage of query-relevant information attested in the underlying collection. While nuggets have traditionally been represented as short statements, recent work has used question-answer (QA) representations, enabling fine-grained evaluations that decouple the information need (i.e. the question) from the potentially diverse content that satisfies it (i.e. its answers). A persistent challenge for nugget-based evaluation is the need to manually curate sets of nuggets for each topic in a test collection -- a laborious process that scales poorly to novel information needs. This challenge is acute in cross-lingual settings, where information is found in multilingual source documents. Accordingly, we introduce DoGMaTiQ, a pipeline for generating high-quality QA-based nugget sets in three stages: (1) document-grounded nugget generation, (2) paraphrase clustering, and (3) nugget subselection based on principled quality criteria. We integrate DoGMaTiQ nuggets with AutoArgue -- a recent nugget-based evaluation framework -- to enable fully automatic evaluation of generated reports. We conduct extensive experiments on two cross-lingual TREC shared tasks, NeuCLIR and RAGTIME, showing strong rank correlations with both human-in-the-loop and fully manual judgments. Finally, detailed analysis of our pipeline reveals that a strong LLM nugget generator is key, and that the system rankings induced by DoGMaTiQ are robust to outlier systems. We facilitate future research in report evaluation by publicly releasing our code and artifacts at https://github.com/manestay/dogmatiq.