Search papers, labs, and topics across Lattice.
This paper investigates the impact of enriching Meaning Representations (MRs) with task demonstrators (MR-sentence pairs) on the performance of fine-tuned language generation models in dialogue tasks. The study evaluates the enriched inputs across four datasets with varying characteristics, using five metrics to assess different linguistic aspects of the generated text. The key finding is that enriched MRs improve generation quality, particularly for complex tasks, small datasets with high MR variability, and zero-shot settings, with semantic metrics proving more effective than lexical metrics for evaluation.
Enriching meaning representations with task demonstrators can significantly boost dialogue generation, especially in challenging scenarios, revealing a simple yet effective strategy for improving NLG performance.
Conversational systems should generate diverse language forms to interact fluently and accurately with users. In this context, Natural Language Generation (NLG) engines convert Meaning Representations (MRs) into sentences, directly influencing user perception. These MRs usually encode the communicative function (e.g., inform, request, confirm) via DAs and enumerate the semantic content with slot-value pairs. In this work, our objective is to analyse whether providing a task demonstrator to the generator enhances the generations of a fine-tuned model. This demonstrator is an MR-sentence pair extracted from the original dataset that enriches the input at training and inference time. The analysis involves five metrics that focus on different linguistic aspects, and four datasets that differ in multiple features, such as domain, size, lexicon, MR variability, and acquisition process. To the best of our knowledge, this is the first study on dialogue NLG implementing a comparative analysis of the impact of MRs on generation quality across domains, corpus characteristics, and the metrics used to evaluate these generations. Our key insight is that the proposed enriched inputs are effective for complex tasks and small datasets with high variability in MRs and sentences. They are also beneficial in zero-shot settings for any domain. Moreover, the analysis of the metrics shows that semantic metrics capture generation quality more accurately than lexical metrics. In addition, among these semantic metrics, those trained with human ratings can detect omissions and other subtle semantic issues that embedding-based metrics often miss. Finally, the evolution of the metric scores and the excellent results for Slot Accuracy and Dialogue Act Accuracy demonstrate that the generative models present fast adaptability to different tasks and robustness at semantic and communicative intention levels.