Microsoft ResearchUlsan National Institute of Science and TechnologyApr 9, 2026arXiv:2604.07786

Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video

Chanhyuk Choi, Chanhyuk Choi, Taesoo Kim, Taesoo Kim, Donggyu Lee, Siyeol Jung, Siyeol Jung, Taehwan Kim, Taehwan Kim

AI Summary

This paper introduces Cross-Modal Emotion Transfer (C-MET), a novel approach for emotion editing in talking face videos that leverages emotion semantic vectors learned between speech and visual feature spaces. C-MET uses a pretrained audio encoder and a disentangled facial expression encoder to model the difference between emotional embeddings across modalities. Experiments on MEAD and CREMA-D datasets show a 14% improvement in emotion accuracy over existing methods, even for unseen extended emotions.

Key Contribution

Achieve more accurate and expressive talking face videos by transferring emotions from speech to facial expressions, even for nuanced emotions like sarcasm, outperforming existing methods by 14%.

Abstract

Talking face generation has gained significant attention as a core application of generative models. To enhance the expressiveness and realism of synthesized videos, emotion editing in talking face video plays a crucial role. However, existing approaches often limit expressive flexibility and struggle to generate extended emotions. Label-based methods represent emotions with discrete categories, which fail to capture a wide range of emotions. Audio-based methods can leverage emotionally rich speech signals - and even benefit from expressive text-to-speech (TTS) synthesis - but they fail to express the target emotions because emotions and linguistic contents are entangled in emotional speeches. Images-based methods, on the other hand, rely on target reference images to guide emotion transfer, yet they require high-quality frontal views and face challenges in acquiring reference data for extended emotions (e.g., sarcasm). To address these limitations, we propose Cross-Modal Emotion Transfer (C-MET), a novel approach that generates facial expressions based on speeches by modeling emotion semantic vectors between speech and visual feature spaces. C-MET leverages a large-scale pretrained audio encoder and a disentangled facial expression encoder to learn emotion semantic vectors that represent the difference between two different emotional embeddings across modalities. Extensive experiments on the MEAD and CREMA-D datasets demonstrate that our method improves emotion accuracy by 14% over state-of-the-art methods, while generating expressive talking face videos - even for unseen extended emotions. Code, checkpoint, and demo are available at https://chanhyeok-choi.github.io/C-MET/

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Cross-Modal Emotion Transfer for Emotion Editing in Talking Face Video

Related Papers