CUHKMBZUAIUSTCMar 10, 2026arXiv:2603.09292

See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation

AI Summary

The paper introduces See, Plan, Rewind (SPR), a vision-language-action framework for robotic manipulation that explicitly tracks task progress through spatial subgoals derived from language instructions. SPR operates in a closed loop, continuously assessing the current state, planning trajectories toward waypoints, and rewinding to recoverable states when progress stalls. Experiments on the LIBERO and LIBERO-Plus benchmarks demonstrate that SPR outperforms baselines and achieves state-of-the-art robustness, particularly in out-of-distribution scenarios with unseen instructions and initial states.

Key Contribution

Robots can now recover from failures during manipulation tasks by explicitly tracking progress against spatial subgoals, without needing extra training data or models.

Abstract

Measurement of task progress through explicit, actionable milestones is critical for robust robotic manipulation. This progress awareness enables a model to ground its current task status, anticipate verifiable intermediate states, and detect and recover from failures when progress stalls. To embody this capability, we introduce See, Plan, Rewind (SPR), a progress-aware vision-language-action framework that dynamically grounds language instructions into a sequence of spatial subgoals. SPR operates through a continuous core cycle, Seeing the current state and upcoming milestone, Planning a trajectory towards the next 2D waypoint, and Rewinding to a recoverable state upon failure by monitoring progress against the expected sequence. This closed-loop approach enables robust error correction without requiring additional training data or auxiliary models. Extensive experiments demonstrate the framework's effectiveness, generalization and robustness: SPR outperforms the MolmoAct baseline by 5\% on the LIBERO benchmark. On the challenging LIBERO-Plus benchmark with unseen instructions and initial states, SPR achieves state-of-the-art robustness with the smallest performance drop, surpassing OpenVLA-OFT and UniVLA, demonstrating superior out-of-distribution robustness.

Multimodal Models Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation

Related Papers