Tsinghua AIJimei UniversityKAISTNTU TaiwanPKUWHUJun 16, 2026arXiv:2606.17536

OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation

Zijie Meng, Yufei Liu, Chengqian Ma, Zhiyu Li, Jiyuan Liu, Wenhua Nie, Bingcai Wei, Shuqin Chen, Weichen Xu, Jiquan Yuan, Miao Zhang

AI Summary

This paper introduces DRIVE-CHOREO, a novel multi-agent world model that addresses the challenges of heterogeneous control injection and post-hoc cross-view fusion in generative world models for autonomous driving. By employing a structured approach where three Qwen2.5-VL agents collaborate to create a unified latent-token sequence, the model effectively integrates language, geometry, and pixel data. The results demonstrate state-of-the-art performance in multi-view consistency and BEV mAP on the nuScenes dataset, with significant improvements in downstream detection tasks using synthetic data.

Key Contribution

DRIVE-CHOREO achieves unprecedented multi-view consistency and BEV mAP by choreographing latent tokens across diverse modalities in autonomous driving.

Abstract

Generative world models for autonomous driving face two unresolved tensions: heterogeneous control injection, where free-form language, HD-maps, trajectories, and camera poses reside in incompatible representational spaces, and post-hoc cross-view fusion, where per-camera latents fail to encode global 3-D geometry. We trace both to a single root cause: the absence of a shared symbolic interlingua aligning language, geometry, and pixels at the latent-token level. We present DRIVE-CHOREO, an LLM-choreographed multi-agent world model that recasts controllable multi-view video generation as latent choreography. Three Qwen2.5-VL agents - a Director parsing user intent into a structured WorldScript, a Cartographer grounding it into spatially-anchored layout tokens, and an Auditor feeding cross-view critiques back as auxiliary supervision - jointly author a single position-aware token sequence. This sequence is co-compressed with the multi-view video via a view-time permutation that enforces inter-camera geometry within the convolutional receptive field of a 3-D VAE. On nuScenes, DRIVE-CHOREO sets new state-of-the-art multi-view consistency and BEV mAP (21.6) with competitive FVD (45.7); a detector trained purely on our synthetic data gains +2.4 NDS on the real validation split, validating downstream utility.

Multimodal Models World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation

Related Papers