Search papers, labs, and topics across Lattice.
The paper introduces a Mean-Flow based One-Step Vision-Language-Action (VLA) approach to address the latency issues in FlowMatching-based VLA frameworks for robotic manipulation. By resolving noise-induced issues in action generation, the method eliminates consistency constraints and enables one-step action generation. Experiments demonstrate that the proposed method achieves significantly faster generation speeds compared to SmolVLA and Diffusion Policy, making it a promising high-efficiency backbone for VLA-based robotic manipulation.
Ditch slow, iterative sampling: a new Mean-Flow method achieves up to 84x faster action generation for vision-language-action robotic control.
Recent advances in FlowMatching-based Vision-Language-Action (VLA) frameworks have demonstrated remarkable advantages in generating high-frequency action chunks, particularly for highly dexterous robotic manipulation tasks. Despite these notable achievements, their practical applications are constrained by prolonged generation latency, which stems from inherent iterative sampling requirements and architectural limitations. To address this critical bottleneck, we propose a Mean-Flow based One-Step VLA approach. Specifically, we resolve the noise-induced issues in the action generation process, thereby eliminating the consistency constraints inherent to conventional Flow-Matching methods. This significantly enhances generation efficiency and enables one-step action generation. Real-world robotic experiments show that the generation speed of the proposed Mean-Flow based One-Step VLA is 8.7 times and 83.9 times faster than that of SmolVLA and Diffusion Policy, respectively. These results elucidate its great potential as a high-efficiency backbone for VLA-based robotic manipulation.