Mar 16, 2026arXiv:2603.15600

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

Yibin Liu, Yaxing Lyu, Daqi Gao, Zhixuan Liang, Weiliang Tang, Shilong Mu, Xiaokang Yang, Yao Mu

AI Summary

This paper introduces PRIMO R1, a 7B video MLLM framework designed to improve process supervision in long-horizon robotic manipulation by transforming passive observers into active critics. The framework uses outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation and anchors video sequences between initial and current state images. Experiments on the PRIMO Dataset and Benchmark, along with the RoboFail benchmark, demonstrate that PRIMO R1 achieves state-of-the-art performance, including a 50% reduction in the mean absolute error compared to specialized reasoning baselines and surpassing OpenAI o1 on failure detection.

Key Contribution

A 7B model, trained with reinforcement learning to reason about robotic manipulation, outperforms 72B-scale general MLLMs and even OpenAI's closed-source models in process supervision and failure detection.

Abstract

Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.

Multimodal Models Reasoning & Chain-of-Thought Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

From Passive Observer to Active Critic: Reinforcement Learning Elicits Process Reasoning for Robotic Manipulation

Related Papers