Mar 12, 2026arXiv:2603.11971

Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling

J. Byeon, Junhyeong Byeon, Jeongyeol Kim, Sejoon Lim

AI Summary

This paper introduces a multimodal emotion recognition framework for video data using pre-trained CLIP and Wav2Vec 2.0 models for visual and audio encoding, respectively. The framework employs a Temporal Convolutional Network (TCN) to model temporal dependencies in facial expressions and a bi-directional cross-attention module for cross-modal fusion. Experimental results on the ABAW 10th EXPR benchmark demonstrate that the proposed framework achieves improved performance over unimodal modeling by effectively combining temporal visual modeling, audio representation learning, and cross-modal fusion.

Key Contribution

Achieve state-of-the-art emotion recognition by fusing visual and audio cues with a bi-directional cross-attention mechanism, outperforming unimodal approaches.

Abstract

Emotion recognition in in-the-wild video data remains a challenging problem due to large variations in facial appearance, head pose, illumination, background noise, and the inherently dynamic nature of human affect. Relying on a single modality, such as facial expressions or speech, is often insufficient to capture these complex emotional cues. To address this issue, we propose a multimodal emotion recognition framework for the Expression (EXPR) Recognition task in the 10th Affective Behavior Analysis in-the-wild (ABAW) Challenge. Our approach leverages large-scale pre-trained models, namely CLIP for visual encoding and Wav2Vec 2.0 for audio representation learning, as frozen backbone networks. To model temporal dependencies in facial expression sequences, we employ a Temporal Convolutional Network (TCN) over fixed-length video windows. In addition, we introduce a bi-directional cross-attention fusion module, in which visual and audio features interact symmetrically to enhance cross-modal contextualization and capture complementary emotional information. A lightweight classification head is then used for final emotion prediction. We further incorporate a text-guided contrastive objective based on CLIP text features to encourage semantically aligned visual representations. Experimental results on the ABAW 10th EXPR benchmark show that the proposed framework provides a strong multimodal baseline and achieves improved performance over unimodal modeling. These results demonstrate the effectiveness of combining temporal visual modeling, audio representation learning, and cross-modal fusion for robust emotion recognition in unconstrained real-world environments.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Multimodal Emotion Recognition via Bi-directional Cross-Attention and Temporal Modeling

Related Papers