Mar 9, 2026arXiv:2603.08572

MetaWorld-X: Hierarchical World Modeling via VLM-Orchestrated Experts for Humanoid Loco-Manipulation

Yutong Shen, Hangxu Liu, Penghui Liu, Jiashuo Luo, Yongkang Zhang, Rex Morvley, Chenfanfu Jiang, Chen Jiang, Jianwei Zhang, Lei Zhang

AI Summary

MetaWorld-X is introduced, a hierarchical world model framework for humanoid loco-manipulation that decomposes complex control problems into specialized expert policies (SEPs) trained with imitation-constrained RL using human motion priors. An Intelligent Routing Mechanism (IRM), supervised by a Vision-Language Model (VLM), then dynamically composes these experts based on high-level task semantics. This approach achieves more natural, stable, and generalizable whole-body control compared to monolithic policy learning.

Key Contribution

Humanoid robots can now perform complex loco-manipulation tasks with more natural and stable movements by decomposing control into VLM-orchestrated expert policies trained with human motion priors.

Abstract

Learning natural, stable, and compositionally generalizable whole-body control policies for humanoid robots performing simultaneous locomotion and manipulation (loco-manipulation) remains a fundamental challenge in robotics. Existing reinforcement learning approaches typically rely on a single monolithic policy to acquire multiple skills, which often leads to cross-skill gradient interference and motion pattern conflicts in high-degree-of-freedom systems. As a result, generated behaviors frequently exhibit unnatural movements, limited stability, and poor generalization to complex task compositions. To address these limitations, we propose MetaWorld-X, a hierarchical world model framework for humanoid control. Guided by a divide-and-conquer principle, our method decomposes complex control problems into a set of specialized expert policies (Specialized Expert Policies, SEP). Each expert is trained under human motion priors through imitation-constrained reinforcement learning, introducing biomechanically consistent inductive biases that ensure natural and physically plausible motion generation. Building upon this foundation, we further develop an Intelligent Routing Mechanism (IRM) supervised by a Vision-Language Model (VLM), enabling semantic-driven expert composition. The VLM-guided router dynamically integrates expert policies according to high-level task semantics, facilitating compositional generalization and adaptive execution in multi-stage loco-manipulation tasks.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References29

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

MetaWorld-X: Hierarchical World Modeling via VLM-Orchestrated Experts for Humanoid Loco-Manipulation

Related Papers