Mar 12, 2026arXiv:2603.12221

A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition

AI Summary

A two-stage dual-modality (audio-visual) model is proposed for facial expression recognition in unconstrained videos, leveraging DINOv2 visual features and Wav2Vec 2.0 audio features. The first stage employs a DINOv2 ViT-L/14 backbone with padding-aware augmentation and a mixture-of-experts head for robust visual feature extraction. The second stage fuses multi-scale visual features with audio features using a gated fusion module and applies temporal smoothing, achieving a Macro-F1 score of 0.5368 on the ABAW validation set.

Key Contribution

DINOv2 visual features and Wav2Vec 2.0 audio features can be effectively fused in a two-stage model to achieve state-of-the-art facial expression recognition in challenging, unconstrained video conditions.

Abstract

This paper addresses the expression (EXPR) recognition challenge in the 10th Affective Behavior Analysis in-the-Wild (ABAW) workshop and competition, which requires frame-level classification of eight facial emotional expressions from unconstrained videos. This task is challenging due to inaccurate face localization, large pose and scale variations, motion blur, temporal instability, and other confounding factors across adjacent frames. We propose a two-stage dual-modal (audio-visual) model to address these difficulties. Stage I focuses on robust visual feature extraction with a pretrained DINOv2-based encoder. Specifically, DINOv2 ViT-L/14 is used as the backbone, a padding-aware augmentation (PadAug) strategy is employed for image padding and data preprocessing from raw videos, and a mixture-of-experts (MoE) training head is introduced to enhance classifier diversity. Stage II addresses modality fusion and temporal consistency. For the visual modality, faces are re-cropped from raw videos at multiple scales, and the extracted visual features are averaged to form a robust frame-level representation. Concurrently, frame-aligned Wav2Vec 2.0 audio features are derived from short audio windows to provide complementary acoustic cues. These dual-modal features are integrated via a lightweight gated fusion module, followed by inference-time temporal smoothing. Experiments on the ABAW dataset demonstrate the effectiveness of the proposed method. The two-stage model achieves a Macro-F1 score of 0.5368 on the official validation set and 0.5122 +/- 0.0277 under 5-fold cross-validation, outperforming the official baselines.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References37

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Two-Stage Dual-Modality Model for Facial Emotional Expression Recognition

Related Papers