Search papers, labs, and topics across Lattice.
This paper addresses the challenge of cross-domain slot filling in spoken language understanding by using LLMs to generate synthetic data for data-scarce target domains. They introduce a two-stage data generation strategy using LLMs to synthesize samples and a data curation mechanism based on confidence and uncertainty to filter out low-quality samples. Experiments demonstrate the effectiveness and generality of the proposed approach in enhancing cross-domain slot filling performance.
LLMs can fill the cross-domain slot filling gap, but only if you carefully curate their synthetic data using confidence and uncertainty metrics.
In real-world scenarios, due to data scarcity, cross-domain slot filling in spoken language understanding remains a significant challenge. Previous works focus on supplementing sequence labeling models with slot meta-information or metric learning. They have poor generalization capabilities lacking specific domain knowledge. To enhance generalization, recent studies introduce implicit general knowledge to enhance the performance of slots lacking domain-specific knowledge by further pretraining or larger-parameter generative models. However, this knowledge is domain-agnostic and difficult to provide comprehensive knowledge for target domain. Therefore, we propose a two-stage data generation strategy, utilizing powerful LLMs to synthesize samples to introduce knowledge for each slot of the data-scarce target domain. More importantly, we employ a data curation mechanism based on confidence and uncertainty to identify and filter out low-quality samples to obtain a high-quality synthetic dataset. Extensive experimental results demonstrate the effectiveness and generality of our approach.