Mar 9, 2026arXiv:2603.08034

Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

Jun Yu, Naixiang Zheng, Guoyuan Wang, Yunxiang Zhang, Lingsi Zhu, Jiaen Liang, Wei Huang, Shengping Liu

AI Summary

This paper introduces a multimodal emotion recognition framework using a dual-branch Transformer with safe cross-attention and modality dropout to handle partial occlusions and missing modalities in real-world environments. The framework dynamically fuses visual and audio representations, leveraging audio when visual cues are absent. The method achieves an accuracy of 60.79% and an F1-score of 0.5029 on the Aff-Wild2 validation set by also employing focal loss and a sliding-window soft voting strategy to address class imbalance and reduce classification jitter.

Key Contribution

A dual-branch Transformer with safe cross-attention overcomes missing visual cues in emotion recognition by dynamically relying on audio, achieving state-of-the-art results on Aff-Wild2.

Abstract

Emotion recognition in real-world environments is hindered by partial occlusions, missing modalities, and severe class imbalance. To address these issues, particularly for the Affective Behavior Analysis in-the-wild (ABAW) Expression challenge, we propose a multimodal framework that dynamically fuses visual and audio representations. Our approach uses a dual-branch Transformer architecture featuring a safe cross-attention mechanism and a modality dropout strategy. This design allows the network to rely on audio-based predictions when visual cues are absent. To mitigate the long-tail distribution of the Aff-Wild2 dataset, we apply focal loss optimization, combined with a sliding-window soft voting strategy to capture dynamic emotional transitions and reduce frame-level classification jitter. Experiments demonstrate that our framework effectively handles missing modalities and complex spatiotemporal dependencies, achieving an accuracy of 60.79% and an F1-score of 0.5029 on the Aff-Wild2 validation set.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

Related Papers