MIT CSAILShanghai Jiaotong UniversitySep 28, 2025arXiv:2509.23584

VividFace: High-Quality and Efficient One-Step Diffusion For Video Face Enhancement

Shulian Zhang, Yong Guo, Long Peng, Ziyang Wang, Ye Chen, Wenbo Li, Xiao Zhang, Yulun Zhang, Jian Chen

AI Summary

The paper introduces VividFace, a one-step diffusion framework for video face enhancement (VFE) that addresses computational inefficiency and limited generalization of existing methods. VividFace reformulates multi-step diffusion into a single-step flow matching process based on the WANX video generation model, significantly reducing inference time. The method incorporates a Joint Latent-Pixel Face-Focused Training strategy and a new high-quality face video dataset, MLLM-Face90, created via an MLLM-driven filtering pipeline.

Key Contribution

Achieve state-of-the-art video face enhancement with VividFace, a one-step diffusion model that drastically cuts inference time while boosting perceptual quality and temporal consistency.

Abstract

Video Face Enhancement (VFE) aims to restore high-quality facial regions from degraded video sequences, enabling a wide range of practical applications. Despite substantial progress in the field, current methods that primarily rely on video super-resolution and generative frameworks continue to face three fundamental challenges: (1) computational inefficiency caused by iterative multi-step denoising in diffusion models; (2) faithfully modeling intricate facial textures while preserving temporal consistency; and (3) limited model generalization due to the lack of high-quality face video training data. To address these challenges, we propose VividFace, a novel and efficient one-step diffusion framework for VFE. Built upon the pretrained WANX video generation model, VividFace reformulates the traditional multi-step diffusion process as a single-step flow matching paradigm that directly maps degraded inputs to high-quality outputs with significantly reduced inference time. To enhance facial detail recovery, we introduce a Joint Latent-Pixel Face-Focused Training strategy that constructs spatiotemporally aligned facial masks to guide optimization toward critical facial regions in both latent and pixel spaces. Furthermore, we develop an MLLM-driven automated filtering pipeline that produces MLLM-Face90, a meticulously curated high-quality face video dataset, ensuring models learn from photorealistic facial textures. Extensive experiments demonstrate that VividFace achieves superior performance in perceptual quality, identity preservation, and temporal consistency across both synthetic and real-world benchmarks. We will publicly release our code, models, and dataset to support future research.

Computer Vision Inference & Quantization Training Efficiency & Optimization

Citation Metrics

Citations3

Influential citations0

References0

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

VividFace: High-Quality and Efficient One-Step Diffusion For Video Face Enhancement

Related Papers