Mar 30, 2026arXiv:2603.28730

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

Philip Schroeder, Philip Schroeder, Thomas Weng, Thomas Weng, Karl Schmeckpeper, Karl Schmeckpeper, Eric Rosen, Eric Rosen, Stephen Hart, Stephen Hart, Ondrej Biza, Ondrej Biza

AI Summary

This paper introduces SOLE-R1, a video-language reasoning model designed to provide dense reward signals for online robot reinforcement learning using only raw video observations and a natural language goal. SOLE-R1 employs per-timestep spatiotemporal chain-of-thought reasoning to estimate task progress, trained via a novel video trajectory and reasoning synthesis pipeline combining supervised fine-tuning with RL from verifiable rewards. Experiments across simulation and real-world settings demonstrate that SOLE-R1 enables zero-shot online RL, outperforming existing vision-language rewarders like GPT-4 and Gemini-1.5 Pro and exhibiting greater robustness to reward hacking.

Key Contribution

Robots can now learn complex manipulation tasks from scratch using only video and language, bypassing the need for hand-engineered reward functions, demonstrations, or even task-specific tuning.

Abstract

Vision-language models (VLMs) have shown impressive capabilities across diverse tasks, motivating efforts to leverage these models to supervise robot learning. However, when used as evaluators in reinforcement learning (RL), today's strongest models often fail under partial observability and distribution shift, enabling policies to exploit perceptual errors rather than solve the task. To address this limitation, we introduce SOLE-R1 (Self-Observing LEarner), a video-language reasoning model explicitly designed to serve as the sole reward signal for online RL. Given only raw video observations and a natural-language goal, SOLE-R1 performs per-timestep spatiotemporal chain-of-thought (CoT) reasoning and produces dense estimates of task progress that can be used directly as rewards. To train SOLE-R1, we develop a large-scale video trajectory and reasoning synthesis pipeline that generates temporally grounded CoT traces aligned with continuous progress supervision. This data is combined with foundational spatial and multi-frame temporal reasoning, and used to train the model with a hybrid framework that couples supervised fine-tuning with RL from verifiable rewards. Across four different simulation environments and a real-robot setting, SOLE-R1 enables zero-shot online RL from random initialization: robots learn previously unseen manipulation tasks without ground-truth rewards, success indicators, demonstrations, or task-specific tuning. SOLE-R1 succeeds on 24 unseen tasks and substantially outperforms strong vision-language rewarders, including GPT-5 and Gemini-3-Pro, while exhibiting markedly greater robustness to reward hacking.

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SOLE-R1: Video-Language Reasoning as the Sole Reward for On-Robot Reinforcement Learning

Related Papers