Stanford HAISJTUJun 9, 2026arXiv:2606.10305

SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation

Qianzhong Chen, Hau Zheng, Justin Yu, Suning Huang, Jiankai Sun, Ken Goldberg, Chuan Wen, Pieter Abbeel, Yide Shentu, Philipp Wu, Mac Schwager

AI Summary

This paper introduces SARM2, a multi-task stage-aware reward model that enhances robotic manipulation by integrating an action-primitive-based stage estimator with a Mixture-of-Experts (MMoE) value head to generate dense per-step rewards. The proposed RM model significantly reduces value-estimation mean squared error (MSE) by 80% compared to existing methods, which either lack generality or require extensive task-specific annotations. Additionally, the SPIRAL framework leverages RM to boost task success rates dramatically, achieving near-perfect performance on specific manipulation tasks through efficient autonomous rollouts.

Key Contribution

High-quality dense rewards can transform robotic manipulation, propelling task success rates from 50% to near perfection in real-world applications.

Abstract

Fine-tuning vision-language-action (VLA) policies for long-horizon manipulation still relies heavily on behavior cloning, which requires costly high-quality demonstrations and keeps policies near the demonstration distribution. Reward models can reduce this dependence by reweighting demonstrations and providing dense supervision for on-robot reinforcement learning (RL), but they must be dense, accurate, and general. Existing methods fall short: task-specific stage-aware models are accurate but require per-task annotations, while general vision-language-model (VLM) reward models are broadly applicable but too coarse for fine-grained long-horizon progress. We introduce RM, a multi-task stage-aware reward model that combines an action-primitive-based stage estimator with a multi-gate Mixture-of-Experts (MMoE) value head to produce dense per-step rewards across manipulation tasks. Building on RM, we further propose SPIRAL (Self-Policy Improvement via Reward-Aligned Learning), an on-policy reward-guided framework that improves VLA policies from cheap autonomous rollouts. On a 10-task benchmark, RM reduces value-estimation MSE by 80% over the strongest baselines; when used in SPIRAL, it improves task success from around 50% to near-perfect performance on Folding Shorts (58% to 100%) and Cleaning Whiteboard (50% to 90%), showing that high-quality dense rewards are key to a stable robot data flywheel. Project website: https://qianzhong-chen.github.io/sarm2.github.io/.

RLHF & Preference Learning Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SARM2: Multi-Task Stage Aware Reward Modeling for Self Improving Robotic Manipulation

Related Papers