Mar 18, 2026arXiv:2603.18361

Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

AI Summary

The paper introduces CommonSyn, a two-stage synthetic dataset generation method to address the lack of large-scale, high-quality, diverse commonsense training data for conversational agents. The method leverages LLMs to generate diverse commonsense scenarios, which are then filtered and refined. Fine-tuning LLMs on CommonSyn improves both the diversity and quality of generated responses compared to models trained on human-annotated datasets.

Key Contribution

Training on synthetically generated data can significantly boost both the diversity and quality of commonsense reasoning in LLMs, outperforming models trained on scarce human-annotated data.

Abstract

Conversational agents are required to respond to their users not only with high quality (i.e. commonsense bearing) responses, but also considering multiple plausible alternative scenarios, reflecting the diversity in their responses. Despite the growing need to train diverse commonsense generators, the progress of this line of work has been significantly hindered by the lack of large-scale high-quality diverse commonsense training datasets. Due to the high annotation costs, existing Generative Commonsense Reasoning (GCR) datasets are created using a small number of human annotators, covering only a narrow set of commonsense scenarios. To address this training resource gap, we propose a two-stage method to create the first-ever synthetic dataset CommonSyn for diversified (GCR). The model fine-tuned on our synthetic data jointly increase both generation diversity and quality compared with vanilla models and the model fine-tuned on human-crafted dataset across different size Large Language Models (LLMs)

Data Curation & Synthetic Data Natural Language Processing Reasoning & Chain-of-Thought

Citation Metrics

Citations0

Influential citations0

References51

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Synthetic Data Generation for Training Diversified Commonsense Reasoning Models

Related Papers