Search papers, labs, and topics across Lattice.
The paper introduces TAEMI, a novel multimodal framework for estimating Emotional Mimicry Intensity (EMI) that addresses the challenge of noisy and missing data in real-world affective computing. TAEMI uses textual transcripts as stable semantic anchors to filter noisy visual and acoustic signals via a Text-Anchored Dual Cross-Attention mechanism. The model further incorporates Learnable Missing-Modality Tokens and Modality Dropout to enhance robustness against missing data, achieving state-of-the-art performance on the Hume-Vidmimic2 dataset.
By using text as an anchor to filter noisy audio-visual signals, TAEMI achieves state-of-the-art emotional mimicry intensity estimation, demonstrating that a stable semantic prior can significantly improve multimodal fusion robustness.
Estimating Emotional Mimicry Intensity (EMI) in naturalistic environments is a critical yet challenging task in affective computing. The primary difficulty lies in effectively modeling the complex, nonlinear temporal dynamics across highly heterogeneous modalities, especially when physical signals are corrupted or missing. To tackle this, we propose TAEMI (Text-Anchored Emotional Mimicry Intensity estimation), a novel multimodal framework designed for the 10th ABAW Competition. Motivated by the observation that continuous visual and acoustic signals are highly susceptible to transient environmental noise, we break the traditional symmetric fusion paradigm. Instead, we leverage textual transcript--which inherently encode a stable, time-independent semantic prior--as central anchors. Specifically, we introduce a Text-Anchored Dual Cross-Attention mechanism that utilizes these robust textual queries to actively filter out frame-level redundancies and align the noisy physical streams. Furthermore, to prevent catastrophic performance degradation caused by inevitably missing data in unconstrained real-world scenarios, we integrate Learnable Missing-Modality Tokens and a Modality Dropout strategy during training. Extensive experiments on the Hume-Vidmimic2 dataset demonstrate that TAEMI effectively captures fine-grained emotional variations and maintains robust predictive resilience under imperfect conditions. Our framework achieves a state-of-the-art mean Pearson correlation coefficient across six continuous emotional dimensions, significantly outperforming existing baseline methods.