Microsoft ResearchMar 29, 2026arXiv:2603.27670

ProgressVLA: Progress-Guided Diffusion Policy for Vision-Language Robotic Manipulation

AI Summary

This paper introduces ProgressVLA, a vision-language-action model for robotic manipulation that incorporates task progress estimation to improve performance in long-horizon tasks. They pre-train a robust progress estimator on unsupervised video-text data, achieving low prediction error in simulation and zero-shot transfer to the real world. By integrating this estimator with an inverse dynamics world model and applying maximal progress regularization, ProgressVLA provides differentiable progress guidance to refine action tokens, leading to significant improvements in success rates and generalization on standard benchmarks and real-world deployment.

Key Contribution

Robots can now better handle long, complex tasks by learning to estimate and maximize their progress towards a goal, leading to substantial performance gains.

Abstract

Most existing vision-language-action (VLA) models for robotic manipulation lack progress awareness, typically relying on hand-crafted heuristics for task termination. This limitation is particularly severe in long-horizon tasks involving cascaded sub-goals. In this work, we investigate the estimation and integration of task progress, proposing a novel model named {\textbf \vla}. Our technical contributions are twofold: (1) \emph{robust progress estimation}: We pre-train a progress estimator on large-scale, unsupervised video-text robotic datasets. This estimator achieves a low prediction residual (0.07 on a scale of $[0, 1]$) in simulation and demonstrates zero-shot generalization to unseen real-world samples, and (2) \emph{differentiable progress guidance}: We introduce an inverse dynamics world model that maps predicted action tokens into future latent visual states. These latents are then processed by the progress estimator; by applying a maximal progress regularization, we establish a differentiable pipeline that provides progress-piloted guidance to refine action tokens. Extensive experiments on the CALVIN and LIBERO benchmarks, alongside real-world robot deployment, consistently demonstrate substantial improvements in success rates and generalization over strong baselines.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

ProgressVLA: Progress-Guided Diffusion Policy for Vision-Language Robotic Manipulation

Related Papers