Search papers, labs, and topics across Lattice.
The paper introduces LinkVLA, a novel Vision-Language-Action model for autonomous driving designed to improve the alignment between language instructions and action outputs while enhancing generation efficiency. LinkVLA unifies language and action tokens within a shared discrete codebook and introduces an auxiliary action understanding objective to generate descriptive captions from trajectories, creating a bidirectional language-action mapping. The model employs a two-step coarse-to-fine generation method (C2F) to efficiently decode action sequences, achieving significant inference time reduction and improved driving performance on closed-loop benchmarks.
LinkVLA tackles the language-action misalignment problem in autonomous driving by unifying language and action tokens in a shared space, leading to faster and more accurate instruction following.
Vision-Language-Action (VLA) models are emerging as a promising paradigm for end-to-end autonomous driving, valued for their potential to leverage world knowledge and reason about complex driving scenes. However, existing methods suffer from two critical limitations: a persistent misalignment between language instructions and action outputs, and the inherent inefficiency of typical auto-regressive action generation. In this paper, we introduce LinkVLA, a novel architecture that directly addresses these challenges to enhance both alignment and efficiency. First, we establish a structural link by unifying language and action tokens into a shared discrete codebook, processed within a single multi-modal model. This structurally enforces cross-modal consistency from the ground up. Second, to create a deep semantic link, we introduce an auxiliary action understanding objective that trains the model to generate descriptive captions from trajectories, fostering a bidirectional language-action mapping. Finally, we replace the slow, step-by-step generation with a two-step coarse-to-fine generation method C2F that efficiently decodes the action sequence, saving 86% inference time. Experiments on closed-loop driving benchmarks show consistent gains in instruction following accuracy and driving performance, alongside reduced inference latency.