Mar 11, 2026arXiv:2603.11041

DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

Shuyao Shang, Binghan Zhan, Yunfei Yan, Yuqi Wang, Yingyan Li, Yasong An, Xiaoman Wang, Jierui Liu, Lu Hou, Lue Fan, Zhaoxiang Zhang, Tieniu Tan

AI Summary

DynVLA, a driving Vision-Language-Action (VLA) model, is introduced to forecast compact world dynamics before action generation, enabling more informed and physically grounded decision-making. It uses a Dynamics Tokenizer to compress future evolution into dynamics tokens and decouples ego-centric and environment-centric dynamics for accurate world modeling. Trained through SFT and RFT to generate dynamics tokens before actions, DynVLA demonstrates superior performance compared to Textual and Visual Chain-of-Thought (CoT) methods on NAVSIM, Bench2Drive, and an in-house dataset.

Key Contribution

By forecasting compact world dynamics before taking action, DynVLA leapfrogs traditional CoT methods to achieve more informed and physically grounded autonomous driving decisions.

Abstract

We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.

Reasoning & Chain-of-Thought Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References81

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DynVLA: Learning World Dynamics for Action Reasoning in Autonomous Driving

Related Papers