UPennYaleMay 6, 2026arXiv:2605.04458

DoGMaTiQ: Automated Generation of Question-and-Answer Nuggets for Report Evaluation

Bryan Li, W. Walden, Gabrielle Kaili-May Liu, Dawn J. Lawrie, James Mayfield, Eugene Yang, Chris Callison-Burch, Laura Dietz

AI Summary

The paper introduces DoGMaTiQ, a three-stage pipeline for automatically generating high-quality question-answer (QA) based nuggets for evaluating long-form, citation-backed reports, particularly in cross-lingual settings. DoGMaTiQ leverages document-grounded nugget generation, paraphrase clustering, and quality-based nugget subselection. Experiments on the NeuCLIR and RAGTIME datasets demonstrate that DoGMaTiQ-generated nuggets, when integrated with the AutoArgue framework, achieve strong rank correlations with human judgments in evaluating generated reports.

Key Contribution

Stop hand-crafting QA datasets for evaluating RAG systems: DoGMaTiQ automates the process with surprisingly high correlation to human judgment, even across languages.

Abstract

Evaluation of long-form, citation-backed reports has lately received significant attention due to the wide-scale adoption of retrieval-augmented generation (RAG) systems. Core to many evaluation frameworks is the use of atomic facts, or nuggets, to assess a report's coverage of query-relevant information attested in the underlying collection. While nuggets have traditionally been represented as short statements, recent work has used question-answer (QA) representations, enabling fine-grained evaluations that decouple the information need (i.e. the question) from the potentially diverse content that satisfies it (i.e. its answers). A persistent challenge for nugget-based evaluation is the need to manually curate sets of nuggets for each topic in a test collection -- a laborious process that scales poorly to novel information needs. This challenge is acute in cross-lingual settings, where information is found in multilingual source documents. Accordingly, we introduce DoGMaTiQ, a pipeline for generating high-quality QA-based nugget sets in three stages: (1) document-grounded nugget generation, (2) paraphrase clustering, and (3) nugget subselection based on principled quality criteria. We integrate DoGMaTiQ nuggets with AutoArgue -- a recent nugget-based evaluation framework -- to enable fully automatic evaluation of generated reports. We conduct extensive experiments on two cross-lingual TREC shared tasks, NeuCLIR and RAGTIME, showing strong rank correlations with both human-in-the-loop and fully manual judgments. Finally, detailed analysis of our pipeline reveals that a strong LLM nugget generator is key, and that the system rankings induced by DoGMaTiQ are robust to outlier systems. We facilitate future research in report evaluation by publicly releasing our code and artifacts at https://github.com/manestay/dogmatiq.

Eval Frameworks & Benchmarks Natural Language Processing Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References41

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DoGMaTiQ: Automated Generation of Question-and-Answer Nuggets for Report Evaluation

Related Papers