Search papers, labs, and topics across Lattice.
VideoMDM introduces a novel diffusion-based framework for generating 3D human motion from 2D pose data extracted from monocular videos, eliminating the need for 3D ground truth. By leveraging a pretrained 2D-to-3D lifter as a noisy teacher, the model learns to diffuse and denoise 3D poses while enforcing consistency through a depth-weighted 2D reprojection loss. The results show that VideoMDM significantly narrows the performance gap to fully 3D-supervised methods, achieving competitive fidelity on both synthetic and real-world datasets.
VideoMDM closes the gap to fully 3D-supervised motion generation, achieving nearly state-of-the-art results with only 2D supervision.
We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.