UIUCMar 3, 2026arXiv:2603.03279

ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

Xialin He, Sirui Xu, Xinyao Li, Runpei Dong, Liuyu Bian, Yu-Xiong Wang, Liang-Yan Gui

AI Summary

The paper introduces ULTRA, a unified framework for autonomous humanoid whole-body loco-manipulation that overcomes limitations of existing methods by combining physics-driven neural retargeting with a multimodal controller. The neural retargeting algorithm translates motion capture data to humanoid embodiments while preserving physical plausibility. The multimodal controller supports both dense references and sparse task specifications, and is trained using a combination of imitation learning, skill compression, and reinforcement learning. ULTRA enables coordinated whole-body behavior from sparse intent without test-time reference motions, demonstrated in simulation and on a real Unitree G1 humanoid.

Key Contribution

Humanoids can now perform complex loco-manipulation tasks from egocentric vision and sparse goals, thanks to a unified controller trained without relying on predefined motion references at test time.

Abstract

Achieving autonomous and versatile whole-body loco-manipulation remains a central barrier to making humanoids practically useful. Yet existing approaches are fundamentally constrained: retargeted data are often scarce or low-quality; methods struggle to scale to large skill repertoires; and, most importantly, they rely on tracking predefined motion references rather than generating behavior from perception and high-level task specifications. To address these limitations, we propose ULTRA, a unified framework with two key components. First, we introduce a physics-driven neural retargeting algorithm that translates large-scale motion capture to humanoid embodiments while preserving physical plausibility for contact-rich interactions. Second, we learn a unified multimodal controller that supports both dense references and sparse task specifications, under sensing ranging from accurate motion-capture state to noisy egocentric visual inputs. We distill a universal tracking policy into this controller, compress motor skills into a compact latent space, and apply reinforcement learning finetuning to expand coverage and improve robustness under out-of-distribution scenarios. This enables coordinated whole-body behavior from sparse intent without test-time reference motions. We evaluate ULTRA in simulation and on a real Unitree G1 humanoid. Results show that ULTRA generalizes to autonomous, goal-conditioned whole-body loco-manipulation from egocentric perception, consistently outperforming tracking-only baselines with limited skills.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References46

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation

Related Papers