NVIDIAD data while Ours only to the textTechnionJun 11, 2026arXiv:2606.13364

VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

A. Mann, Amir Mann, Gal Michael Harari, Merav Keidar, O. Litany, Or Litany

AI Summary

VideoMDM introduces a novel diffusion-based framework for generating 3D human motion from 2D pose data extracted from monocular videos, eliminating the need for 3D ground truth. By leveraging a pretrained 2D-to-3D lifter as a noisy teacher, the model learns to diffuse and denoise 3D poses while enforcing consistency through a depth-weighted 2D reprojection loss. The results show that VideoMDM significantly narrows the performance gap to fully 3D-supervised methods, achieving competitive fidelity on both synthetic and real-world datasets.

Key Contribution

VideoMDM closes the gap to fully 3D-supervised motion generation, achieving nearly state-of-the-art results with only 2D supervision.

Abstract

We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References55

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

Related Papers