Saarland Informatics CampusApr 20, 2026arXiv:2604.18583

MUA: Mobile Ultra-detailed Animatable Avatars

AI Summary

This paper introduces a novel avatar representation called Wavelet-guided Multi-level Spatial Factorized Blendshapes, which effectively combines high-fidelity appearance and dynamic geometry with low computational requirements. By employing a distillation pipeline, the authors transfer motion-aware clothing dynamics and fine-grained details from a high-quality avatar model to a compact representation, achieving a remarkable 2000X reduction in computational cost and a 10X decrease in model size. Extensive evaluations demonstrate that this approach not only surpasses existing mobile avatar methods but also matches or exceeds the rendering quality of server-based models, enabling real-time performance on resource-constrained devices like the Meta Quest 3.

Key Contribution

Achieving ultra-high-fidelity avatars on mobile devices is now possible with a 2000X reduction in computational cost without sacrificing visual quality.

Abstract

Building photorealistic, animatable full-body digital humans remains a longstanding challenge in computer graphics and vision. Recent advances in animatable avatar modeling have largely progressed along two directions: improving the fidelity of dynamic geometry and appearance, or reducing computational complexity to enable deployment on resource-constrained platforms, e.g., VR headsets. However, existing approaches fail to achieve both goals simultaneously: Ultra-high-fidelity avatars typically require substantial computation on server-class GPUs, whereas lightweight avatars often suffer from limited surface dynamics, reduced appearance details, and noticeable artifacts. To bridge this gap, we propose a novel animatable avatar representation, termed Wavelet-guided Multi-level Spatial Factorized Blendshapes, and a corresponding distillation pipeline that transfers motion-aware clothing dynamics and fine-grained appearance details from a pre-trained ultra-high-quality avatar model into a compact, efficient representation. By coupling multi-level wavelet spectral decomposition with low-rank structural factorization in texture space, our method achieves up to 2000X lower computational cost and a 10X smaller model size than the original high-quality teacher avatar model, while preserving visually plausible dynamics and appearance details closely resemble those of the teacher model. Extensive comparisons with state-of-the-art methods show that our approach significantly outperforms existing avatar approaches designed for mobile settings and achieves comparable or superior rendering quality to most approaches that can only run on servers. Importantly, our representation substantially improves the practicality of high-fidelity avatars for immersive applications, achieving over 180 FPS on a desktop PC and real-time native on-device performance at 24 FPS on a standalone Meta Quest 3.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MUA: Mobile Ultra-detailed Animatable Avatars

Related Papers