Search papers, labs, and topics across Lattice.
MotionVL automates reward function design for humanoid robot RL by using a VLM to encode visual observations into natural language, which an LLM then uses to generate task-aligned rewards. This closed-loop system dynamically optimizes rewards and aligns behavior across diverse motor skills, leading to more human-like and robust motions. Experiments show MotionVL outperforms handcrafted and LLM-only baselines in simulated bipedal tasks and real-world straight-leg walking, demonstrating improved task success, robustness, energy efficiency, stability, and human-likeness.
Humanoid robots can now learn more human-like and robust movements without hand-engineered reward functions, thanks to a vision-language model that automatically translates observations into language-based rewards.
Reward function design remains a fundamental challenge in reinforcement learning of humanoid robots, where handcrafted rewards often fail to capture human-like behavior, limit generalization, and require costly manual tuning. Addressing this bottleneck is essential for enabling scalable, adaptive, interpretable, and robust humanoid motion control. In this work, MotionVL, a novel framework for humanoid motion reinforcement learning that leverages multimodal large models to automate reward generation and semantic supervision, is proposed. A Vision-Language Model (VLM) encodes visual observations into structured natural language descriptions, while a Large Language Model (LLM) generates task-aligned reward functions based on these descriptions and the given instructions. This closed-loop design supports dynamic reward optimization and behavior alignment across diverse motor skills. We validate MotionVL through simulations of bipedal jumping, single-leg balancing, running, and real-world deployment of straight-leg walking, which demonstrates higher task success rates, improved robustness to external pushes, better energy efficiency, and superior stability and human-likeness over handcrafted and LLM-only baselines, thereby establishing a scalable paradigm for language-informed humanoid motion learning.