G H Raisoni College of EngineeringPandit Sundarlal Sharma (Open) University ChattishgarhSymbiosis Institute of TechnologyAug 1, 2025

Multimodal Sarcasm Analysis: Leveraging Hierarchical Fusion and Sentiment Alignment

Parul Dubey, R. Pradhan, Nitin Rakesh, Sumit Prasad, Pushpa M. Chutel, Pranali Dhawas

AI Summary

The paper introduces a Sentiment-Aware Hierarchical Fusion Network (SAHFN) for multimodal sarcasm detection, addressing the limitations of unimodal and existing multimodal approaches in capturing subtle contradictions between text and images. SAHFN employs hierarchical fusion and crossmodal transformers to model inter-modal dependencies, and incorporates a contrastive learning mechanism to improve sarcasm classification. Experimental results demonstrate that SAHFN achieves an accuracy of 88.9% on a multimodal sarcasm detection task, outperforming baseline models by effectively aligning sarcastic text with visual cues.

Key Contribution

Achieve near-SOTA multimodal sarcasm detection by explicitly aligning text and image sentiment via a hierarchical fusion network.

Abstract

Detecting sarcasm is a very difficult task in both natural language processing (NLP) as well as computer vision because it involves understanding contradictions in both text and images. Traditional text approaches have difficulty with implicit sarcasm; most image-based methods do general semantic reasoning but do not align contextual information with linguistic cues. To tackle this problem, we propose a Sentiment-Aware Hierarchical Fusion Network (SAHFN) that fuses the information of text, images, and the sentiment-aware embeddings of image and text together, so as to enhance the performance of sarcasm detection. It utilizes hierarchical fusion and crossmodal transformers to model inter-modal dependencies, together with a contrastive learning mechanism to enhance sarcasm classification. Experimental results show that SAHFN exceeds the performance of the baseline models by aligning the sarcastic text with its visual cues-achieving an accuracy of 88.9% on the sarcasm detection task. Train and Validate graphs showing progressive optimization confirm the model's robustness in sarcasm recognition. Metadata identification and detection is a practical application of this technique often used in social media analysis, sentiment detection, and automated moderation systems. The sarcasm detection model can be improved in the future as used more extensive datasets having multimodal data tracked over time.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References16

Year2025

Venue2025 12th International Conference on Emerging Trends in Engineering & Technology - Signal and Information Processing (ICETET - SIP)

Related Papers

Finding related papers...

Search

Multimodal Sarcasm Analysis: Leveraging Hierarchical Fusion and Sentiment Alignment

Related Papers