Ant GroupZJUMay 26, 2026arXiv:2605.26879

Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos

Dingkun Wei, Zehong Shen, Yan Xia, Georgios Pavlakos, Xiaowei Zhou

AI Summary

HTD-Refine, a post-processing framework, enhances monocular Human Motion Recovery (HMR) by explicitly estimating and incorporating high-order temporal dynamics (velocity and acceleration). A temporal transformer, PVA-Net, predicts per-joint 2D positions, 3D velocities, and 3D accelerations from video, which are then used as constraints in a global optimization to refine trajectories. Experiments on in-the-wild benchmarks demonstrate that HTD-Refine improves existing HMR methods, yielding more accurate trajectories and natural motion.

Key Contribution

Monocular human motion capture can be dramatically improved by explicitly modeling high-order temporal dynamics like velocity and acceleration, leading to more realistic and less jittery movements.

Abstract

Human motion recovered from monocular videos often appears overly smooth or dynamically inconsistent, even when joint positions are numerically accurate. We observe that this limitation stems from the absence of reliable high-order temporal cues -- velocity and acceleration -- which are essential for reconstructing motion that exhibits realistic momentum, timing, and high-frequency detail. We introduce HTD-Refine, a post-processing framework that augments existing Human Motion Recovery (HMR) pipelines using explicitly estimated high-order temporal dynamics. At the core of our system is PVA-Net, a temporal transformer that infers per-joint 2D positions, 3D velocities, and 3D accelerations directly from a monocular video. These predicted dynamics serve as soft yet informative constraints in a global optimization procedure that refines world-space trajectories, significantly reducing jitter, suppressing over-smoothing, and restoring physically plausible motion. Extensive experiments on challenging in-the-wild benchmarks show that HTD-Refine consistently improves state-of-the-art HMR methods, yielding more accurate global trajectories and substantially more natural motion dynamics. Our results highlight the critical role of high-order temporal modeling in advancing monocular human motion recovery.

Computer Vision Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Natural Human Motion Recovery by Aligning High-Order Temporal Dynamics from Monocular Videos

Related Papers