Search papers, labs, and topics across Lattice.
HumANDiff introduces a novel framework for human video generation that enhances motion control by sampling articulated, motion-consistent noise on a 3D human body template. The method jointly learns appearance and motion by predicting both from the articulated noises, and enforces physical motion consistency through a geometric motion consistency loss in the articulated noise space. Experiments show state-of-the-art performance in generating motion-consistent, high-fidelity human videos with diverse clothing styles, without modifying the underlying diffusion model architecture.
Ditch the Gaussian noise: HumANDiff generates more realistic human videos by sampling noise directly on a 3D articulated human body, ensuring motion consistency.
Despite tremendous recent progress in human video generation, generative video diffusion models still struggle to capture the dynamics and physics of human motions faithfully. In this paper, we propose a new framework for human video generation, HumANDiff, which enhances the human motion control with three key designs: 1) Articulated motion-consistent noise sampling that correlates the spatiotemporal distribution of latent noise and replaces the unstructured random Gaussian noise with 3D articulated noise sampled on the dense surface manifold of a statistical human body template. It inherits body topology priors for spatially and temporally consistent noise sampling. 2) Joint appearance-motion learning that enhances the standard training objective of video diffusion models by jointly predicting pixel appearances and corresponding physical motions from the articulated noises. It enables high-fidelity human video synthesis, e.g., capturing motion-dependent clothing wrinkles. 3) Geometric motion consistency learning that enforces physical motion consistency across frames via a novel geometric motion consistency loss defined in the articulated noise space. HumANDiff enables scalable controllable human video generation by fine-tuning video diffusion models with articulated noise sampling. Consequently, our method is agnostic to diffusion model design, and requires no modifications to the model architecture. During inference, HumANDiff enables image-to-video generation within a single framework, achieving intrinsic motion control without requiring additional motion modules. Extensive experiments demonstrate that our method achieves state-of-the-art performance in rendering motion-consistent, high-fidelity humans with diverse clothing styles. Project page: https://taohuumd.github.io/projects/HumANDiff/