Search papers, labs, and topics across Lattice.
The paper introduces VBCSNet, a novel hybrid attention-based multimodal framework for sentiment analysis that integrates Vision Transformer (ViT), BERT, and CLIP. VBCSNet uses Structured Self-Attention (SSA) within modalities and Cross-Attention across modalities to improve feature representation and semantic alignment. The model is trained with a multi-objective loss function minimizing classification, modality alignment, and contrastive losses, achieving state-of-the-art results on three multilingual datasets (MVSA, IJCAI2019, JP-Buzz).
VBCSNet's hybrid attention architecture and structured self-attention unlock superior multimodal sentiment analysis, achieving state-of-the-art results and improved interpretability across languages.
Multimodal Sentiment Analysis (MSA), a pivotal task in affective computing, aims to enhance sentiment understanding by integrating heterogeneous data from modalities such as text, images, and audio. However, existing methods continue to face challenges in semantic alignment, modality fusion, and interpretability. To address these limitations, we propose VBCSNet, a hybrid attention-based multimodal framework that leverages the complementary strengths of Vision Transformer (ViT), BERT, and CLIP architectures. VBCSNet employs a Structured Self-Attention (SSA) mechanism to optimize intra-modal feature representation and a Cross-Attention module to achieve fine-grained semantic alignment across modalities. Furthermore, we introduce a multi-objective optimization strategy that jointly minimizes classification loss, modality alignment loss, and contrastive loss, thereby enhancing semantic consistency and feature discriminability. We evaluated VBCSNet on three multilingual multimodal sentiment datasets, including MVSA, IJCAI2019, and a self-constructed Japanese Twitter corpus(JP-Buzz). Experimental results demonstrated that VBCSNet significantly outperformed state-of-the-art baselines in terms of Accuracy, Macro-F1, and cross-lingual generalization. Per-class performance analysis further highlighted the model’s interpretability and robustness. Overall, VBCSNet advances sentiment classification across languages and domains while offering a transparent reasoning mechanism suitable for real-world applications in affective computing, human-computer interaction, and socially aware AI systems.