SJTUMar 31, 2026

MotionVL: Vision-Language Supervision for Reinforcement Learning of Humanoid Motion

Yan Luo, Jianhua Wu, Zhenhua Xiong, Han Ding

AI Summary

MotionVL automates reward function design for humanoid robot RL by using a VLM to encode visual observations into natural language, which an LLM then uses to generate task-aligned rewards. This closed-loop system dynamically optimizes rewards and aligns behavior across diverse motor skills, leading to more human-like and robust motions. Experiments show MotionVL outperforms handcrafted and LLM-only baselines in simulated bipedal tasks and real-world straight-leg walking, demonstrating improved task success, robustness, energy efficiency, stability, and human-likeness.

Key Contribution

Humanoid robots can now learn more human-like and robust movements without hand-engineered reward functions, thanks to a vision-language model that automatically translates observations into language-based rewards.

Abstract

Reward function design remains a fundamental challenge in reinforcement learning of humanoid robots, where handcrafted rewards often fail to capture human-like behavior, limit generalization, and require costly manual tuning. Addressing this bottleneck is essential for enabling scalable, adaptive, interpretable, and robust humanoid motion control. In this work, MotionVL, a novel framework for humanoid motion reinforcement learning that leverages multimodal large models to automate reward generation and semantic supervision, is proposed. A Vision-Language Model (VLM) encodes visual observations into structured natural language descriptions, while a Large Language Model (LLM) generates task-aligned reward functions based on these descriptions and the given instructions. This closed-loop design supports dynamic reward optimization and behavior alignment across diverse motor skills. We validate MotionVL through simulations of bipedal jumping, single-leg balancing, running, and real-world deployment of straight-leg walking, which demonstrates higher task success rates, improved robustness to external pushes, better energy efficiency, and superior stability and human-likeness over handcrafted and LLM-only baselines, thereby establishing a scalable paradigm for language-informed humanoid motion learning.

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References26

Year2026

VenueIEEE Robotics and Automation Letters

Related Papers

Finding related papers...

Search

MotionVL: Vision-Language Supervision for Reinforcement Learning of Humanoid Motion

Related Papers