Mar 4, 2026arXiv:2603.03882

UniSync: Towards Generalizable and High-Fidelity Lip Synchronization for Challenging Scenarios

Ruidi Fan, Siyuan Wang, Tian Yu, Yutong Jiang, Xusheng Liu

AI Summary

UniSync, a novel framework, tackles the challenges of high-fidelity lip synchronization across diverse scenarios by combining mask-free pose-anchored training with mask-based blending during inference. This approach mitigates color discrepancies and texture misalignment issues prevalent in existing methods. Fine-tuning on a diverse dataset enables UniSync to generalize effectively to complex real-world scenarios, as demonstrated by superior performance on the newly introduced RealWorld-LipSync benchmark.

Key Contribution

UniSync achieves state-of-the-art lip synchronization by cleverly combining mask-free training for color consistency with mask-based inference for structural precision, finally delivering on the promise of generalizable, production-ready results.

Abstract

Lip synchronization aims to generate realistic talking videos that match given audio, which is essential for high-quality video dubbing. However, current methods have fundamental drawbacks: mask-based approaches suffer from local color discrepancies, while mask-free methods struggle with global background texture misalignment. Furthermore, most methods struggle with diverse real-world scenarios such as stylized avatars, face occlusion, and extreme lighting conditions. In this paper, we propose UniSync, a unified framework designed for achieving high-fidelity lip synchronization in diverse scenarios. Specifically, UniSync uses a mask-free pose-anchored training strategy to keep head motion and eliminate synthesis color artifacts, while employing mask-based blending consistent inference to ensure structural precision and smooth blending. Notably, fine-tuning on compact but diverse videos empowers our model with exceptional domain adaptability, handling complex corner cases effectively. We also introduce the RealWorld-LipSync benchmark to evaluate models under real-world demands, which covers diverse application scenarios including both human faces and stylized avatars. Extensive experiments demonstrate that UniSync significantly outperforms state-of-the-art methods, advancing the field towards truly generalizable and production-ready lip synchronization.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

UniSync: Towards Generalizable and High-Fidelity Lip Synchronization for Challenging Scenarios

Related Papers