Tsinghua AIMar 30, 2026arXiv:2603.28565

StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation

Yiran Shi, Yi Shi, Dong Guo, Dongqi Guo, Tianchen Zhao, Feng Gao, Liangzhi Shi, Chaoyang Yu, Chao Yu, Zhijian Mo, Qihua Xiao, Qi Xiao, Xiaoshuai Peng, Qingmin Liao, Yu Wang

AI Summary

The paper introduces StreamingVLA, a novel vision-language-action model designed for efficient real-world deployment by enabling asynchronous parallelization across observation, action generation, and execution stages. This is achieved through action flow matching, which learns action trajectories instead of denoising chunk-wise actions, and an action saliency-aware adaptive observation mechanism. StreamingVLA significantly reduces latency (2.4x speedup) and execution halting (6.5x reduction) without sacrificing performance, making it more suitable for resource-constrained edge platforms.

Key Contribution

StreamingVLA achieves a remarkable 2.4x speedup and 6.5x reduction in execution halting by asynchronously parallelizing observation, action generation, and execution stages in vision-language-action models.

Abstract

Vision-language-action (VLA) models have demonstrated exceptional performance in natural language-driven perception and control. However, the high computational cost of VLA models poses significant efficiency challenges, particularly for resource-constrained edge platforms in real-world deployments. However, since different stages of VLA (observation, action generation and execution) must proceed sequentially, and wait for the completion of the preceding stage, the system suffers from frequent halting and high latency. To address this, We conduct a systematic analysis to identify the challenges for fast and fluent generation, and propose enabling VLAs with the ability to asynchronously parallelize across VLA stages in a"streaming"manner. First, we eliminate the reliance on action chunking and adopt action flow matching, which learns the trajectory of action flows rather than denoising chunk-wise actions. It overlaps the latency of action generation and execution. Second, we design an action saliency-aware adaptive observation mechanism, thereby overlapping the latency of execution and observation. Without sacrificing performance, StreamingVLA achieves substantial speedup and improves the fluency of execution. It achieves a 2.4 $\times$ latency speedup and reduces execution halting by 6.5 $\times$.

Inference & Quantization Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

StreamingVLA: Streaming Vision-Language-Action Model with Action Flow Matching and Adaptive Early Observation

Related Papers