Mar 2, 2026arXiv:2603.01418

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

Hebeizi Li, Zihao Liang, Zihao Liang, Benyuan Sun, Benyuan Sun, Zihao Yin, Zihao Yin, Xiao Sha, Xiao Sha, Chenliang Wang, Chenliang Wang, Yi Yang, Yi Yang

AI Summary

The paper introduces UniTalking, an open-source, end-to-end diffusion framework for high-fidelity talking portrait generation with personalized voice cloning. UniTalking uses Multi-Modal Transformer Blocks to explicitly model temporal correspondence between audio and video latent tokens via shared self-attention. Experiments demonstrate UniTalking achieves state-of-the-art performance compared to existing open-source methods in lip-sync accuracy, audio naturalness, and overall perceptual quality.

Key Contribution

Open-source UniTalking rivals closed-source giants like Veo3 and Sora2 in talking-head video realism, thanks to its multi-modal transformer and pre-trained video priors.

Abstract

While state-of-the-art audio-video generation models like Veo3 and Sora2 demonstrate remarkable capabilities, their closed-source nature makes their architectures and training paradigms inaccessible. To bridge this gap in accessibility and performance, we introduce UniTalking, a unified, end-to-end diffusion framework for generating high-fidelity speech and lip-synchronized video. At its core, our framework employs Multi-Modal Transformer Blocks to explicitly model the fine-grained temporal correspondence between audio and video latent tokens via a shared self-attention mechanism. By leveraging powerful priors from a pre-trained video generation model, our framework ensures state-of-the-art visual fidelity while enabling efficient training. Furthermore, UniTalking incorporates a personalized voice cloning capability, allowing the generation of speech in a target style from a brief audio reference. Qualitative and quantitative results demonstrate that our method produces highly realistic talking portraits, achieving superior performance over existing open-source approaches in lip-sync accuracy, audio naturalness, and overall perceptual quality.

Computer Vision Multimodal Models Speech & Audio

Citation Metrics

Citations0

Influential citations0

References42

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

UniTalking: A Unified Audio-Video Framework for Talking Portrait Generation

Related Papers