Search papers, labs, and topics across Lattice.
This paper introduces ProgressVLA, a vision-language-action model for robotic manipulation that incorporates task progress estimation to improve performance in long-horizon tasks. They pre-train a robust progress estimator on unsupervised video-text data, achieving low prediction error in simulation and zero-shot transfer to the real world. By integrating this estimator with an inverse dynamics world model and applying maximal progress regularization, ProgressVLA provides differentiable progress guidance to refine action tokens, leading to significant improvements in success rates and generalization on standard benchmarks and real-world deployment.
Robots can now better handle long, complex tasks by learning to estimate and maximize their progress towards a goal, leading to substantial performance gains.
Most existing vision-language-action (VLA) models for robotic manipulation lack progress awareness, typically relying on hand-crafted heuristics for task termination. This limitation is particularly severe in long-horizon tasks involving cascaded sub-goals. In this work, we investigate the estimation and integration of task progress, proposing a novel model named {\textbf \vla}. Our technical contributions are twofold: (1) \emph{robust progress estimation}: We pre-train a progress estimator on large-scale, unsupervised video-text robotic datasets. This estimator achieves a low prediction residual (0.07 on a scale of $[0, 1]$) in simulation and demonstrates zero-shot generalization to unseen real-world samples, and (2) \emph{differentiable progress guidance}: We introduce an inverse dynamics world model that maps predicted action tokens into future latent visual states. These latents are then processed by the progress estimator; by applying a maximal progress regularization, we establish a differentiable pipeline that provides progress-piloted guidance to refine action tokens. Extensive experiments on the CALVIN and LIBERO benchmarks, alongside real-world robot deployment, consistently demonstrate substantial improvements in success rates and generalization over strong baselines.