Jun 4, 2026arXiv:2606.06491

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

Dong Jing, Jingchen Nie, Tianqi Zhang, Jiaqi Liu, Huaxiu Yao, Zhiwu Lu, Mingyu Ding

AI Summary

This paper introduces TempoVLA, a Vision-Language-Action model that enables controllable execution speed for robotic manipulation by utilizing a novel Variable-Speed Trajectory Augmentation (VSTA) technique. By re-timing demonstration data to target speeds and integrating a conditioning mechanism, TempoVLA allows robots to dynamically adjust their speed based on the context of the task, improving both efficiency and precision during manipulation. Experimental results show that TempoVLA not only achieves the desired speed with minimal motion error but also enhances performance through better data utilization, demonstrating its effectiveness in both simulation and real-world scenarios.

Key Contribution

Robots can now adjust their manipulation speed on-the-fly, achieving both rapid execution in low-risk phases and precision in high-risk tasks.

Abstract

Robot manipulation alternates between low-risk transit phases that call for fast execution and high-risk contact stages that demand slow, precise motion. Yet existing Vision-Language-Action models (VLAs) only inherit a single fixed speed from training demonstrations. Prior efforts to accelerate VLAs through model compression, KV-cache reuse, or reinforcement learning only shift the policy from one fixed speed to another, and leave deceleration almost unexplored. We observe that the magnitude of each predicted action already governs how fast the robot moves, opening a direct route to controllable execution speed. We turn this observation into TempoVLA, a single VLA whose execution speed is controlled by an explicit condition. TempoVLA combines two coupled components. (1) A data-side Variable-Speed Trajectory Augmentation (VSTA) that re-times demonstration to any target speed by merging or splitting actions while preserving its motion semantics. (2) A model-side conditioning mechanism that feeds the speed to the policy. Statistics show that VSTA reaches the requested speed with negligible motion error. Experiments in simulation and on real-world tasks demonstrate that TempoVLA achieves flexible speed control in both directions, while VSTA additionally boosts the default $1\times$ performance via better data utilization. Furthermore, by cooperating with a large multimodal model, TempoVLA realizes dynamic speed control, accelerating through low-risk phases and decelerating for high-risk ones.

Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TempoVLA: Learning Speed-Controllable Vision-Language-Action Policies

Related Papers