FudanShanghai InnovationUSTCJun 4, 2026arXiv:2606.05737

Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models

Yitong Chen, Shiduo Zhang, Jingjing Gong, Xipeng Qiu

AI Summary

This paper challenges the conventional iterative denoising approach in diffusion-based vision-language-action (VLA) models by demonstrating that effective one-step action generation can be achieved through a simple biasing of the training time distribution towards high-noise states. The authors validate their method through controlled experiments on the MNIST grid-to-sequence task and extensive evaluations on various robot-policy benchmarks, revealing that one-step policies can match or even outperform traditional ten-step decoding methods. Notably, their approach achieves a remarkable 95.6% success rate on the LIBERO-Long dataset using a 1.4B VLM model, highlighting the potential for simplified action generation in VLA systems.

Key Contribution

One-step action generation in VLA models can outperform ten-step methods by simply biasing training towards high-noise states, challenging the need for complex iterative processes.

Abstract

Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and state, but predicts only a compact, low-dimensional action chunk. Under this asymmetry, strong one-step action generation should not necessarily require the advanced one-step methods developed for image synthesis. We keep standard velocity prediction and add no teacher model, distillation stage, or auxiliary objective; in our main recipe, we simply bias the training time distribution toward high-noise states. We first isolate the effect in a controlled MNIST grid-to-sequence task, then test it with extensive robot-policy experiments. Across standard LIBERO, LIBERO-Plus, and LIBERO-Pro, one-step policies trained with high-noise biased schedules generally match ten-step decoding under the same recipe, and on standard LIBERO can exceed ten-step policies trained with a uniform time distribution. A real-robot bimanual YAM RSS evaluation gives a small-sample cross-architecture check of the same sampler trend. On a 1.4B VLM model with a 30M action head, one-step decoding reaches 95.6\% on LIBERO-Long. These results show that strong one-step VLA action generation can emerge from standard diffusion training, without importing the full few-step diffusion machinery developed for image generation.

Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models

Related Papers