Search papers, labs, and topics across Lattice.
This paper introduces DRIVE-CHOREO, a novel multi-agent world model that addresses the challenges of heterogeneous control injection and post-hoc cross-view fusion in generative world models for autonomous driving. By employing a structured approach where three Qwen2.5-VL agents collaborate to create a unified latent-token sequence, the model effectively integrates language, geometry, and pixel data. The results demonstrate state-of-the-art performance in multi-view consistency and BEV mAP on the nuScenes dataset, with significant improvements in downstream detection tasks using synthetic data.
DRIVE-CHOREO achieves unprecedented multi-view consistency and BEV mAP by choreographing latent tokens across diverse modalities in autonomous driving.
Generative world models for autonomous driving face two unresolved tensions: heterogeneous control injection, where free-form language, HD-maps, trajectories, and camera poses reside in incompatible representational spaces, and post-hoc cross-view fusion, where per-camera latents fail to encode global 3-D geometry. We trace both to a single root cause: the absence of a shared symbolic interlingua aligning language, geometry, and pixels at the latent-token level. We present DRIVE-CHOREO, an LLM-choreographed multi-agent world model that recasts controllable multi-view video generation as latent choreography. Three Qwen2.5-VL agents - a Director parsing user intent into a structured WorldScript, a Cartographer grounding it into spatially-anchored layout tokens, and an Auditor feeding cross-view critiques back as auxiliary supervision - jointly author a single position-aware token sequence. This sequence is co-compressed with the multi-view video via a view-time permutation that enforces inter-camera geometry within the convolutional receptive field of a 3-D VAE. On nuScenes, DRIVE-CHOREO sets new state-of-the-art multi-view consistency and BEV mAP (21.6) with competitive FVD (45.7); a detector trained purely on our synthetic data gains +2.4 NDS on the real validation split, validating downstream utility.