Apr 27, 2026arXiv:2604.24681

Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

Yifan Xie, Yuan Wang, Guangyu Chen, Jinkun Liu, Yu Sun, Wenbo Ding

AI Summary

The paper introduces MoT-HRA, a hierarchical vision-language-action framework for learning human-intention priors from a new 2.2M-episode dataset (HA-2.2M) of human manipulation videos. MoT-HRA factorizes manipulation into vision-language, intention (hand motion), and fine-grained action experts, using a shared-attention trunk and read-only key-value transfer to integrate human priors into robot control. Experiments demonstrate that MoT-HRA enhances motion plausibility and robustness in simulated and real-world robotic manipulation tasks, particularly under distribution shift.

Key Contribution

Robots can now leverage human intuition for manipulation tasks, learning from a massive video dataset to improve motion plausibility and robustness, even when conditions change.

Abstract

Human videos contain rich manipulation priors, but using them for robot learning remains difficult because raw observations entangle scene understanding, human motion, and embodiment-specific action. We introduce MoT-HRA, a hierarchical vision-language-action framework that learns human-intention priors from large-scale human demonstrations. We first curate HA-2.2M, a 2.2M-episode action-language dataset reconstructed from heterogeneous human videos through hand-centric filtering, spatial reconstruction, temporal segmentation, and language alignment. On top of this dataset, MoT-HRA factorizes manipulation into three coupled experts: a vision-language expert predicts an embodiment-agnostic 3D trajectory, an intention expert models MANO-style hand motion as a latent human-motion prior, and a fine expert maps the intention-aware representation to robot action chunks. A shared-attention trunk and read-only key-value transfer allow downstream control to use human priors while limiting interference with upstream representations. Experiments on hand motion generation, simulated manipulation, and real-world robot tasks show that MoT-HRA improves motion plausibility and robust control under distribution shift.

Data Curation & Synthetic Data Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References56

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

Related Papers