Jiahao Gong

Afari Intelligent Drive Abstract End-to-end autonomous driving aims to generate safe and plausible planning policies from raw sensor input, and constructing an effective scene representation is a critical challenge. Driving world models have shown great potential in learning rich representations by predicting the future evolution of a driving scene. However, existing driving world models primarily focus on visual scene representation, and motion representation is not explicitly designed to be planner-shared and inheritable, leaving a schism between the optimization of visual scene generation and the requirements of precise motion planning. We present WorldDrive, a holistic framework that couples scene generation and real-time planning via unifying vision and motion representation. We first introduce a Trajectory-aware Driving World Model, which conditions on a trajectory vocabulary to enforce consistency between visual dynamics and motion intentions, enabling the generation of diverse and plausible future scenes conditioned on a specific trajectory. We transfer the vision and motion encoders to a downstream Multi-modal Planner, ensuring the driving policy operates on mature representations pre-optimized by scene generation. A simple interaction between motion representation, visual representation, and ego status can generate high-quality, multi-modal trajectories. Furthermore, to exploit the world model’s foresight, we propose a Future-aware Rewarder, which distills future latent representation from the frozen world model to evaluate and select optimal trajectories in real-time. Extensive experiments on the NAVSIM, NAVSIM-v2, and nuScenes benchmarks demonstrate that WorldDrive achieves leading planning performance among vision-only methods while maintaining high-fidelity action-controlled video generation capabilities, providing strong evidence for the effectiveness of unifying vision and motion representation for robust autonomous driving. The code is available at https://github.com/TabGuigui/WorldDrive. 1 Introduction End-to-end autonomous driving aspires to learn direct sensor-to-action policies [4, 7, 43], which hinges on effective visual representation learning [21, 42, 60, 38]. Propelled by recent breakthroughs in generative modeling, Driving World Models (DWMs) are emerging as a promising paradigm for autonomous driving [10, 13, 35]. By explicitly modeling the future scene evolution, DWMs provide a powerful foundation for forecasting the complex driving environments and downstream tasks. Figure 1: World models for end-to-end autonomous driving. (a) Planning with future scenes generated by a driving world model. (b) Planning with semantic representation extracted from a latent world model. (c) WorldDrive bridges planning and driving world model via unifying vision and motion representation. Leveraging this predictive capability, the representation learned by DWMs holds immense potential for end-to-end planning. Current integration strategies generally follow two main paradigms. One approach leverages high-fidelity models to generate future scenes for downstream planners [48, 56, 59, 27]. However, this process incurs prohibitive computational costs. To mitigate this overhead, a second paradigm operates entirely within a latent world model [29, 61, 53]. Although more efficient, this approach sacrifices interpretable scene simulation and limits explicit visual verification that is valuable for safety-critical decision-making. Crucially, we identify a systemic limitation pervading both strategies: representation misalignment and task disconnection. Scene simulators are typically optimized for perceptual reconstruction, emphasizing vision representation, whereas planners are trained in isolation for action regression to encode motion representation. This lack of a unified representation — one that is shared and consistent across both scene generation and planning tasks — prevents the planner from fully leveraging the generative driving world model’s learned dynamics and motion priors. To bridge this gap, we introduce WorldDrive, a holistic framework designed to synergize end-to-end planning with scene generation via unifying vision and motion representation. The core philosophy of WorldDrive is representation unification: we posit that the latent features capable of generating the future (scene generation) should be the same features used to decide the future (planning). To this end, we first propose a Trajectory-aware Driving World Model (TA-DWM). Unlike prior works that treat motion as a superficial condition, TA-DWM employs a multi-modal trajectory encoding scheme built on predefined trajectory anchors to construct a structured latent space where visual dynamics are intrinsically coupled with motion intentions. This design enables representation inheritance: the robust vision and motion encoders learned by the TA-DWM are directly transferred to initialize the downstream planner. This ensures that the planner operates in a mature and consistent latent space pre-aligned by the future scene generation task. Building upon this unified representation, we design a lightweight Multi-modal Planner. Leveraging these frozen encoders, the planner uses a query-centric cross-attention mechanism to efficiently fuse historical visual context with structured motion priors, generating diverse and high-quality trajectory candidates. To harness the predictive foresight of the world model while avoiding the high latency of explicit video generation, we further introduce the Future-aware Rewarder (FAR). Although TA-DWM is capable of synthesizing future videos, executing this generative process for every trajectory candidate is computationally prohibitive. Instead, the FAR employs a planning-oriented distillation mechanism to directly distill the future latents from the frozen world model. This approach enables WorldDrive to evaluate candidate trajectories based on their corresponding distilled future latent. This effectively aligns the planner’s selection with the world model’s learned dynamics while maintaining real-time inference speeds suitable for onboard deployment. The main contributions of our work can be summarized as follows: • We introduce a novel trajectory-aware driving world model (TA-DWM) that unifies the vision and motion representation. This design enables action-controllable future scene generation, producing plausible futures that are physically consistent with the conditioning trajectory. • Leveraging the powerful representation learned by TA-DWM, we design a lightweight multi-modal planner that effectively fuses vision and motion cues. We also introduce a future-aware rewarder module that leverages the TA-DWM’s foresight without the latency of explicit video generation, enabling real-time re-scoring of trajectory candidates at inference. • We integrate these components into WorldDrive, a holistic framework that bridges the representational schism between visual simulation and trajectory planning. Its design enables two capabilities: high-fidelity, motion-consistent future scene generation, and multi-modal, real-time planning. • We provide extensive evidence that WorldDrive achieves state-of-the-art WM-based end-to-end planning performance on NAVSIM, NAVSIM-v2, and nuScenes while concurrently achieving high-fidelity performance on conditional scene generation tasks. 2 Related Works 2.1 End-to-end Autonomous Driving End-to-end autonomous driving aims to directly generate motion planning from raw sensor inputs and has attracted increasing research attention [4, 7, 16, 20]. Most end-to-end autonomous driving systems follow a general framework that integrates perception, prediction, and planning in a cascaded [21, 55] or parallel manner [49, 8] with a structured BEV feature [33, 34, 36, 14]. To reduce the reliance on dense BEV features, several works have proposed the use of sparse representations [24, 57, 42, 23]. Leveraging driving priors, VADv2 [5] introduced a trajectory vocabulary as a prior for trajectory sampling. Building on this, Hydra-MDP [32] further distilled rule-based information. Additionally, DiffusionDrive [36] proposed a trajectory diffusion framework based on the trajectory vocabulary to accelerate the denoising process, and WoTE [31] utilized the trajectory anchor and future BEV state prediction to enhance the driving performance. Figure 2: Overall architecture of WorldDrive. WorldDrive is a holistic framework unifying vision and motion representation to bridge scene generation and planning. The training process includes Phase 1: WorldDrive for scene generation and Phase 2: WorldDrive for motion planning. The vision and motion representations are optimized through the scene generation task. In the planning stage, the planner utilizes the frozen vision and trajectory encoders and outputs top-KK multi-modal trajectories. A future-aware rewarder is further designed to select the optimal trajectory from the candidates. 2.2 Driving World Models Driving world models aim to predict the scene evolution from observations [10, 13, 44, 25] and have shown great potential for generating long-horizon [11, 12, 15], high-fidelity [47, 26, 50], and corner-case data [40, 46, 41]. Although these methods are capable of generating driving scenarios, driving world models need to simulate reasonable scenarios based on motion conditions [22, 19, 52, 17]. Vista [12] achieved strong dynamic modeling by utilizing a larger volume of driving scenario data and introducing a motion loss. Building on this, DriVerse [28] realized trajectory-specific video generation by encoding trajectories as textual prompts and motion priors, and incorporated a motion alignment module. ReSim [51] enriched real-world human demonstrations with diverse non-expert data collected from a driving simulator. DrivingGPT [6] and Epona [58] introduced a discrete action representation at each timestep on top of an autoregressive video generation framework, enabling controllable trajectory generation. 2.3 World Model for Planning Research on optimizing planning policy using the world model has begun to show potential due to the strong representation capability [10, 13, 35]. Drive-WM [48] was the first work to introduce the driving world model into end-to-end planning. It used predicted trajectories as conditions for future scenario prediction and combined perception modules to evaluate scenario safety. FSDrive [56] proposed a visual chain-of-thought pipeline for future scene generation and planning. DrivingGPT [6] and Epona [58] used discrete and continuous autoregressive models to unify the generation of driving scenarios and driving trajectories, respectively. PWM [59] and DriveVLA-W0 [30] coupled trajectory planning with world simulation via action-free future state forecasting. Considering the high cost and latency of scene generation, latent world model methods, such as LAW [29], World

Papers on Lattice

Total citations

Topics

Research focus

Computer Vision (1)Robotics & Embodied AI (1)World Models & Planning (1)

Frequent co-authors

Xingtai Gui (1)Meijie Zhang (1)Jianbing Shen (1)

Papers (1)

Mar 16, 2026

Xingtai Gui +3Mar 16, 2026

Bridging Scene Generation and Planning: Driving with World Model via Unifying Vision and Motion Representation

WorldDrive achieves leading autonomous driving performance by unifying visual scene generation and motion planning, demonstrating that a shared representation space significantly improves both prediction accuracy and planning robustness.

Xingtai Gui, Meijie Zhang, Jiahao Gong +1

Computer Vision Robotics & Embodied AI World Models & Planning

Search

Jiahao Gong

Research focus

Frequent co-authors

Papers (1)