Search papers, labs, and topics across Lattice.
The paper introduces LiftAvatar, a video diffusion Transformer that completes sparse kinematic inputs (facial expressions and head pose) to drive high-fidelity 3D Gaussian avatar animation. LiftAvatar uses a multi-granularity expression control scheme (shading maps and expression coefficients) and a multi-reference conditioning mechanism to aggregate cues from multiple frames, improving 3D consistency and controllability. Experiments demonstrate that LiftAvatar enhances animation quality and quantitative metrics of existing 3D avatar methods, particularly for unseen expressions, by enabling effective prior distillation from large-scale video generative models.
By completing sparse kinematic data with a video diffusion Transformer, LiftAvatar unlocks more expressive and robust 3D avatar animation from monocular video, even for extreme and unseen expressions.
We present LiftAvatar, a new paradigm that completes sparse monocular observations in kinematic space (e.g., facial expressions and head pose) and uses the completed signals to drive high-fidelity avatar animation. LiftAvatar is a fine-grained, expression-controllable large-scale video diffusion Transformer that synthesizes high-quality, temporally coherent expression sequences conditioned on single or multiple reference images. The key idea is to lift incomplete input data into a richer kinematic representation, thereby strengthening both reconstruction and animation in downstream 3D avatar pipelines. To this end, we introduce (i) a multi-granularity expression control scheme that combines shading maps with expression coefficients for precise and stable driving, and (ii) a multi-reference conditioning mechanism that aggregates complementary cues from multiple frames, enabling strong 3D consistency and controllability. As a plug-and-play enhancer, LiftAvatar directly addresses the limited expressiveness and reconstruction artifacts of 3D Gaussian Splatting-based avatars caused by sparse kinematic cues in everyday monocular videos. By expanding incomplete observations into diverse pose-expression variations, LiftAvatar also enables effective prior distillation from large-scale video generative models into 3D pipelines, leading to substantial gains. Extensive experiments show that LiftAvatar consistently boosts animation quality and quantitative metrics of state-of-the-art 3D avatar methods, especially under extreme, unseen expressions.