CASK ChallengeMitsubishi Electric Research Laboratories (MERL)Jun 8, 2026arXiv:2606.09630

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

Haodi Hu, Chung-Ta Huang, Jing Liu, Ye Wang, Kei Suzuki, Matthew Brand, Toshiaki Koike-Akino

AI Summary

This paper introduces ReCoVLA, a framework that enhances vision-language-action (VLA) policies by employing an external vision-language model (VLM) to identify failure modes and guide recovery without altering the pretrained VLA policy. By compiling structured rewards based on task-relevant components, ReCoVLA effectively decouples high-level failure understanding from low-level corrective actions, enabling improved performance in both simulation and real-world applications. Experimental results demonstrate a significant increase in success rates, with a jump from 36.7% to 66.7% in simulation tasks and achieving 61.7% success in zero-shot sim-to-real scenarios.

Key Contribution

A novel reward compilation approach boosts VLA policy success rates by over 30% in both simulated and real-world manipulation tasks.

Abstract

Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery framework that keeps a pretrained VLA policy frozen, uses an external vision-language model (VLM) to infer the failure mode and recovery stage, and compiles a structured reward from task-relevant components. Rather than using the VLM to generate actions or rewards directly, ReCoVLA uses it as a semantic reward selector: it predicts a recovery descriptor and reward mask for in-simulation residual-policy training, followed by zero-shot sim-to-real deployment of the trained recovery policies. This decouples high-level failure understanding from low-level corrective control to support different VLAs. Experiments across short-horizon, long-horizon, and contact-rich manipulation tasks show that ReCoVLA outperforms the tested baselines on average. In simulation, our reward compiler improves average success from 36.7% for the fine-tuned $\pi_{0.5}$ baseline to 66.7%. In physical zero-shot sim-to-real experiments, ReCoVLA achieves the best average performance, with 61.7% success.

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

Related Papers