Search papers, labs, and topics across Lattice.
The paper introduces a Sentiment-Aware Hierarchical Fusion Network (SAHFN) for multimodal sarcasm detection, addressing the limitations of unimodal and existing multimodal approaches in capturing subtle contradictions between text and images. SAHFN employs hierarchical fusion and crossmodal transformers to model inter-modal dependencies, and incorporates a contrastive learning mechanism to improve sarcasm classification. Experimental results demonstrate that SAHFN achieves an accuracy of 88.9% on a multimodal sarcasm detection task, outperforming baseline models by effectively aligning sarcastic text with visual cues.
Achieve near-SOTA multimodal sarcasm detection by explicitly aligning text and image sentiment via a hierarchical fusion network.
Detecting sarcasm is a very difficult task in both natural language processing (NLP) as well as computer vision because it involves understanding contradictions in both text and images. Traditional text approaches have difficulty with implicit sarcasm; most image-based methods do general semantic reasoning but do not align contextual information with linguistic cues. To tackle this problem, we propose a Sentiment-Aware Hierarchical Fusion Network (SAHFN) that fuses the information of text, images, and the sentiment-aware embeddings of image and text together, so as to enhance the performance of sarcasm detection. It utilizes hierarchical fusion and crossmodal transformers to model inter-modal dependencies, together with a contrastive learning mechanism to enhance sarcasm classification. Experimental results show that SAHFN exceeds the performance of the baseline models by aligning the sarcastic text with its visual cues-achieving an accuracy of 88.9% on the sarcasm detection task. Train and Validate graphs showing progressive optimization confirm the model's robustness in sarcasm recognition. Metadata identification and detection is a practical application of this technique often used in social media analysis, sentiment detection, and automated moderation systems. The sarcasm detection model can be improved in the future as used more extensive datasets having multimodal data tracked over time.