Case WesternKuaishouLi AutoZJUMar 2, 2026arXiv:2603.01441

Unifying Language-Action Understanding and Generation for Autonomous Driving

Xinyang Wang, Qian Liu, Qiang Liu, Wenjie Ding, Zhao Yang, Zhaorui Yang, Wei Li, Chang Liu, Chang Liu, Bailin Li, Kun Zhan, Kun Zhan, Xianpeng Lang, Xianpeng Lang, Wei Chen

AI Summary

The paper introduces LinkVLA, a novel Vision-Language-Action model for autonomous driving designed to improve the alignment between language instructions and action outputs while enhancing generation efficiency. LinkVLA unifies language and action tokens within a shared discrete codebook and introduces an auxiliary action understanding objective to generate descriptive captions from trajectories, creating a bidirectional language-action mapping. The model employs a two-step coarse-to-fine generation method (C2F) to efficiently decode action sequences, achieving significant inference time reduction and improved driving performance on closed-loop benchmarks.

Key Contribution

LinkVLA tackles the language-action misalignment problem in autonomous driving by unifying language and action tokens in a shared space, leading to faster and more accurate instruction following.

Abstract

Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language-action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method C2F that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.

Architecture Design (Transformers, SSMs, MoE)Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References61

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Unifying Language-Action Understanding and Generation for Autonomous Driving

Related Papers