Search papers, labs, and topics across Lattice.
The paper introduces Robot-R1, a reinforcement learning framework designed to improve embodied reasoning in robotics by training models to predict the next keypoint state for task completion, conditioned on scene images and environment metadata. Robot-R1 addresses the limitations of supervised fine-tuning (SFT), such as heuristically constructed datasets and catastrophic forgetting, by sampling reasoning-based responses and reinforcing those leading to more accurate predictions. Experimental results on a newly introduced benchmark demonstrate that Robot-R1 outperforms SFT methods and even surpasses GPT-4o in reasoning tasks related to low-level action control, despite having significantly fewer parameters.
A 7B model trained with reinforcement learning beats GPT-4o on robotic reasoning tasks involving low-level action control.
Large Vision-Language Models (LVLMs) have recently shown great promise in advancing robotics by combining embodied reasoning with robot control. A common approach involves training on embodied reasoning tasks related to robot control using Supervised Fine-Tuning (SFT). However, SFT datasets are often heuristically constructed and not explicitly optimized for improving robot control. Furthermore, SFT often leads to issues such as catastrophic forgetting and reduced generalization performance. To address these limitations, we introduce Robot-R1, a novel framework that leverages reinforcement learning to enhance embodied reasoning specifically for robot control. Robot-R1 learns to predict the next keypoint state required for task completion, conditioned on the current scene image and environment metadata derived from expert demonstrations. Inspired by the DeepSeek-R1 learning approach, Robot-R1 samples reasoning-based responses and reinforces those that lead to more accurate predictions. To rigorously evaluate Robot-R1, we also introduce a new benchmark that demands the diverse embodied reasoning capabilities for the task. Our experiments show that models trained with Robot-R1 outperform SFT methods on embodied reasoning tasks. Despite having only 7B parameters, Robot-R1 even surpasses GPT-4o on reasoning tasks related to low-level action control, such as spatial and movement reasoning.