FPT Software AI CenterKAISTMar 15, 2026arXiv:2603.14267

DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization

Ngoc-Son Nguyen, Thanh V. T. Tran, Jeongsoo Choi, Hieu-Nghia Huynh-Nguyen, Truong-Son Hy, Van Nguyen

AI Summary

DiFlowDubber, a novel two-stage framework, is introduced for automated video dubbing, leveraging discrete flow matching and knowledge transfer from pre-trained TTS models. The framework incorporates a FaPro module to capture prosody from facial expressions and a Synchronizer module to ensure precise speech-lip synchronization by bridging the modality gap between text, video, and speech. Experiments on benchmark datasets show DiFlowDubber surpasses existing methods in dubbing quality and synchronization accuracy.

Key Contribution

Achieve more natural and synchronized video dubbing by conditioning a discrete flow matching TTS model on facial expressions and cross-modal alignment.

Abstract

Video dubbing has broad applications in filmmaking, multimedia creation, and assistive speech technology. Existing approaches either train directly on limited dubbing datasets or adopt a two-stage pipeline that adapts pre-trained text-to-speech (TTS) models, which often struggle to produce expressive prosody, rich acoustic characteristics, and precise synchronization. To address these issues, we propose DiFlowDubber with a novel two-stage training framework that effectively transfers knowledge from a pre-trained TTS model to video-driven dubbing, with a discrete flow matching generative backbone. Specifically, we design a FaPro module that captures global prosody and stylistic cues from facial expressions and leverages this information to guide the modeling of subsequent speech attributes. To ensure precise speech-lip synchronization, we introduce a Synchronizer module that bridges the modality gap among text, video, and speech, thereby improving cross-modal alignment and generating speech that is temporally synchronized with lip movements. Experiments on two primary benchmark datasets demonstrate that DiFlowDubber outperforms previous methods across multiple metrics.

Multimodal Models Natural Language Processing Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DiFlowDubber: Discrete Flow Matching for Automated Video Dubbing via Cross-Modal Alignment and Synchronization

Related Papers