BeihangMar 4, 2026arXiv:2603.04307

Dual Diffusion Models for Multi-modal Guided 3D Avatar Generation

Hong Li, Yutang Feng, Minqi Meng, Yichen Yang, Xuhui Liu, Baochang Zhang

AI Summary

PromptAvatar introduces a dual diffusion model framework for generating 3D avatars from text and/or image prompts, comprising a Texture Diffusion Model (TDM) and a Geometry Diffusion Model (GDM). A large-scale, multi-modal dataset of 100K+ samples was created to train these models, enabling direct mapping from prompts to 3D representations. The method achieves high-fidelity, shading-free avatar generation in under 10 seconds, outperforming existing SDS-based approaches in quality, detail, and speed.

Key Contribution

Generate high-fidelity 3D avatars in seconds, not minutes, by directly mapping multi-modal prompts to 3D representations using a dual diffusion model trained on a new large-scale dataset.

Abstract

Generating high-fidelity 3D avatars from text or image prompts is highly sought after in virtual reality and human-computer interaction. However, existing text-driven methods often rely on iterative Score Distillation Sampling (SDS) or CLIP optimization, which struggle with fine-grained semantic control and suffer from excessively slow inference. Meanwhile, image-driven approaches are severely bottlenecked by the scarcity and high acquisition cost of high-quality 3D facial scans, limiting model generalization. To address these challenges, we first construct a novel, large-scale dataset comprising over 100,000 pairs across four modalities: fine-grained textual descriptions, in-the-wild face images, high-quality light-normalized texture UV maps, and 3D geometric shapes. Leveraging this comprehensive dataset, we propose PromptAvatar, a framework featuring dual diffusion models. Specifically, it integrates a Texture Diffusion Model (TDM) that supports flexible multi-condition guidance from text and/or image prompts, alongside a Geometry Diffusion Model (GDM) guided by text prompts. By learning the direct mapping from multi-modal prompts to 3D representations, PromptAvatar eliminates the need for time-consuming iterative optimization, successfully generating high-fidelity, shading-free 3D avatars in under 10 seconds. Extensive quantitative and qualitative experiments demonstrate that our method significantly outperforms existing state-of-the-art approaches in generation quality, fine-grained detail alignment, and computational efficiency.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Dual Diffusion Models for Multi-modal Guided 3D Avatar Generation

Related Papers