Search papers, labs, and topics across Lattice.
MotionVL uses VLMs to encode visual observations of a humanoid robot into natural language descriptions, which are then fed to an LLM to generate task-aligned reward functions for RL. This allows for dynamic reward optimization and behavior alignment across diverse motor skills, bypassing the need for hand-crafted rewards. Experiments show MotionVL achieves higher task success, robustness, energy efficiency, stability, and human-likeness compared to handcrafted and LLM-only baselines in simulated bipedal jumping, balancing, running, and real-world straight-leg walking.
Forget hand-crafted rewards: MotionVL uses VLMs and LLMs to automatically generate task-aligned reward functions for humanoid robot RL, leading to more human-like and robust motion.
Reward function design remains a fundamental challenge in reinforcement learning of humanoid robots, where handcrafted rewards often fail to capture human-like behavior, limit generalization, and require costly manual tuning. Addressing this bottleneck is essential for enabling scalable, adaptive, interpretable, and robust humanoid motion control. In this work, MotionVL, a novel framework for humanoid motion reinforcement learning that leverages multimodal large models to automate reward generation and semantic supervision, is proposed. A Vision-Language Model (VLM) encodes visual observations into structured natural language descriptions, while a Large Language Model (LLM) generates task-aligned reward functions based on these descriptions and the given instructions. This closed-loop design supports dynamic reward optimization and behavior alignment across diverse motor skills. We validate MotionVL through simulations of bipedal jumping, single-leg balancing, running, and real-world deployment of straight-leg walking, which demonstrates higher task success rates, improved robustness to external pushes, better energy efficiency, and superior stability and human-likeness over handcrafted and LLM-only baselines, thereby establishing a scalable paradigm for language-informed humanoid motion learning.