Apr 9, 2026arXiv:2604.08381

A GAN and LLM-Driven Data Augmentation Framework for Dynamic Linguistic Pattern Modeling in Chinese Sarcasm Detection

Wenxian Wang, Wenxian Wang, Xiaohu Luo, Junfeng Hao, Junfeng Hao, Xiaoming Gu, Xiaoming Gu, Xingshu Chen, Zhuo Wang, Zhu Wang, Haizhou Wang, Haizhou Wang

AI Summary

This paper introduces SinaSarc, a new Chinese sarcasm detection dataset created by augmenting real-world Sina Weibo data using a GAN and GPT-3.5 to generate synthetic sarcastic comments. The authors then extend the BERT architecture to incorporate user historical behavior, allowing the model to capture dynamic linguistic patterns indicative of sarcasm. Experiments show that the proposed model achieves state-of-the-art F1-scores on both sarcastic and non-sarcastic categories, demonstrating the effectiveness of incorporating user-specific linguistic patterns.

Key Contribution

Injecting user history into a BERT model trained on GAN and LLM-augmented data unlocks state-of-the-art Chinese sarcasm detection.

Abstract

Sarcasm is a rhetorical device that expresses criticism or emphasizes characteristics of certain individuals or situations through exaggeration, irony, or comparison. Existing methods for Chinese sarcasm detection are constrained by limited datasets and high construction costs, and they mainly focus on textual features, overlooking user-specific linguistic patterns that shape how opinions and emotions are expressed. This paper proposes a Generative Adversarial Network (GAN) and Large Language Model (LLM)-driven data augmentation framework to dynamically model users'linguistic patterns for enhanced Chinese sarcasm detection. First, we collect raw data from various topics on Sina Weibo. Then, we train a GAN on these data and apply a GPT-3.5 based data augmentation technique to synthesize an extended sarcastic comment dataset, named SinaSarc. This dataset contains target comments, contextual information, and user historical behavior. Finally, we extend the BERT architecture to incorporate multi-dimensional information, particularly user historical behavior, enabling the model to capture dynamic linguistic patterns and uncover implicit sarcastic cues in comments. Experimental results demonstrate the effectiveness of our proposed method. Specifically, our model achieves the highest F1-scores on both the non-sarcastic and sarcastic categories, with values of 0.9138 and 0.9151 respectively, which outperforms all existing state-of-the-art (SOTA) approaches. This study presents a novel framework for dynamically modeling users'long-term linguistic patterns in Chinese sarcasm detection, contributing to both dataset construction and methodological advancement in this field.

Data Curation & Synthetic Data Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References62

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A GAN and LLM-Driven Data Augmentation Framework for Dynamic Linguistic Pattern Modeling in Chinese Sarcasm Detection

Related Papers