Dec 15, 2025arXiv:2512.13177

MMDrive: Interactive Scene Understanding Beyond Vision with Multi-representational Fusion

Minghui Hou, Wei-Hsing Huang, Shaofeng Liang, Daizong Liu, Tai-Hao Wen, Gang Wang, Runwei Guan, Weiping Ding

AI Summary

The paper introduces MMDrive, a multimodal vision-language model for autonomous driving that extends beyond 2D image understanding to incorporate 3D scene information from occupancy maps, LiDAR point clouds, and textual descriptions. MMDrive employs a Text-oriented Multimodal Modulator for dynamically weighting modalities based on question semantics and a Cross-Modal Abstractor to generate compact, cross-modal summaries. Experiments on DriveLM and NuScenes-QA benchmarks demonstrate that MMDrive significantly outperforms existing vision-language models, achieving improved BLEU-4, METEOR, and accuracy scores.

Key Contribution

Autonomous driving scene understanding gets a 3D upgrade: MMDrive fuses vision, LiDAR, and text to smash benchmarks, leaving image-only models in the dust.

Abstract

Vision-language models enable the understanding and reasoning of complex traffic scenarios through multi-source information fusion, establishing it as a core technology for autonomous driving. However, existing vision-language models are constrained by the image understanding paradigm in 2D plane, which restricts their capability to perceive 3D spatial information and perform deep semantic fusion, resulting in suboptimal performance in complex autonomous driving environments. This study proposes MMDrive, an multimodal vision-language model framework that extends traditional image understanding to a generalized 3D scene understanding framework. MMDrive incorporates three complementary modalities, including occupancy maps, LiDAR point clouds, and textual scene descriptions. To this end, it introduces two novel components for adaptive cross-modal fusion and key information extraction. Specifically, the Text-oriented Multimodal Modulator dynamically weights the contributions of each modality based on the semantic cues in the question, guiding context-aware feature integration. The Cross-Modal Abstractor employs learnable abstract tokens to generate compact, cross-modal summaries that highlight key regions and essential semantics. Comprehensive evaluations on the DriveLM and NuScenes-QA benchmarks demonstrate that MMDrive achieves significant performance gains over existing vision-language models for autonomous driving, with a BLEU-4 score of 54.56 and METEOR of 41.78 on DriveLM, and an accuracy score of 62.7% on NuScenes-QA. MMDrive effectively breaks the traditional image-only understanding barrier, enabling robust multimodal reasoning in complex driving environments and providing a new foundation for interpretable autonomous driving scene understanding.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References42

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

MMDrive: Interactive Scene Understanding Beyond Vision with Multi-representational Fusion

Related Papers