CASHanoi University of Science and TechnologyInstitut Polytechnique de ParisNorthwesternMar 18, 2026arXiv:2603.18091

Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

Zhuoran Wang, Shifeng Bao, Shifeng Bao, Guanlin Li, Youhe Feng

AI Summary

The paper introduces Action-Draft-and-Verify (ADV), a novel framework for Vision-Language-Action (VLA) models that combines the strengths of diffusion-based action generation and auto-regressive verification. ADV uses a diffusion action expert to generate multiple candidate action chunks, which are then scored and selected by a VLM using a perplexity-style metric in a single forward pass. Experiments demonstrate that ADV significantly improves success rates compared to a diffusion-based baseline, achieving +4.3 points in simulation and +19.7 points in real-world environments.

Key Contribution

Ditch the diffusion vs. autoregressive debate: this VLA framework uses diffusion to *draft* actions and an autoregressive model to *verify* them, boosting real-world success by nearly 20%.

Abstract

Vision-Language-Action (VLA) models have recently demonstrated strong performance across embodied tasks. Modern VLAs commonly employ diffusion action experts to efficiently generate high-precision continuous action chunks, while auto-regressive generation can be slower and less accurate at low-level control. Yet auto-regressive paradigms still provide complementary priors that can improve robustness and generalization in out-of-distribution environments. To leverage both paradigms, we propose Action-Draft-and-Verify (ADV): diffusion action expert drafts multiple candidate action chunks, and the VLM selects one by scoring all candidates in a single forward pass with a perplexity-style metric. Under matched backbones, training data, and action-chunk length, ADV improves success rate by +4.3 points in simulation and +19.7 points in real-world over diffusion-based baseline, with a single-pass VLM reranking overhead.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References35

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model

Related Papers