Search papers, labs, and topics across Lattice.
DynVLA, a driving Vision-Language-Action (VLA) model, is introduced to forecast compact world dynamics before action generation, enabling more informed and physically grounded decision-making. It uses a Dynamics Tokenizer to compress future evolution into dynamics tokens and decouples ego-centric and environment-centric dynamics for accurate world modeling. Trained through SFT and RFT to generate dynamics tokens before actions, DynVLA demonstrates superior performance compared to Textual and Visual Chain-of-Thought (CoT) methods on NAVSIM, Bench2Drive, and an in-house dataset.
By forecasting compact world dynamics before taking action, DynVLA leapfrogs traditional CoT methods to achieve more informed and physically grounded autonomous driving decisions.
We propose DynVLA, a driving VLA model that introduces a new CoT paradigm termed Dynamics CoT. DynVLA forecasts compact world dynamics before action generation, enabling more informed and physically grounded decision-making. To obtain compact dynamics representations, DynVLA introduces a Dynamics Tokenizer that compresses future evolution into a small set of dynamics tokens. Considering the rich environment dynamics in interaction-intensive driving scenarios, DynVLA decouples ego-centric and environment-centric dynamics, yielding more accurate world dynamics modeling. We then train DynVLA to generate dynamics tokens before actions through SFT and RFT, improving decision quality while maintaining latency-efficient inference. Compared to Textual CoT, which lacks fine-grained spatiotemporal understanding, and Visual CoT, which introduces substantial redundancy due to dense image prediction, Dynamics CoT captures the evolution of the world in a compact, interpretable, and efficient form. Extensive experiments on NAVSIM, Bench2Drive, and a large-scale in-house dataset demonstrate that DynVLA consistently outperforms Textual CoT and Visual CoT methods, validating the effectiveness and practical value of Dynamics CoT.