Mar 3, 2026arXiv:2603.02609

VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction

A. E. Doruk, A. Enes Doruk, Hasan F. Ates

AI Summary

VLMFusionOcc3D is introduced, a multimodal framework for 3D semantic occupancy prediction that leverages Vision-Language Models (VLMs) to improve performance in autonomous driving scenarios, particularly under adverse weather conditions. The method uses Instance-driven VLM Attention (InstVLM) to inject semantic priors into 3D voxels and Weather-Aware Adaptive Fusion (WeathFusion) to dynamically re-weight sensor contributions based on environmental reliability. Experiments on nuScenes and SemanticKITTI datasets demonstrate consistent performance improvements over state-of-the-art voxel-based baselines, especially in challenging weather.

Key Contribution

By fusing LiDAR and camera data with VLM-derived semantic priors, VLMFusionOcc3D achieves state-of-the-art 3D semantic occupancy prediction, especially in adverse weather conditions where other methods falter.

Abstract

This paper introduces VLMFusionOcc3D, a robust multimodal framework for dense 3D semantic occupancy prediction in autonomous driving. Current voxel-based occupancy models often struggle with semantic ambiguity in sparse geometric grids and performance degradation under adverse weather conditions. To address these challenges, we leverage the rich linguistic priors of Vision-Language Models (VLMs) to anchor ambiguous voxel features to stable semantic concepts. Our framework initiates with a dual-branch feature extraction pipeline that projects multi-view images and LiDAR point clouds into a unified voxel space. We propose Instance-driven VLM Attention (InstVLM), which utilizes gated cross-attention and LoRA-adapted CLIP embeddings to inject high-level semantic and geographic priors directly into the 3D voxels. Furthermore, we introduce Weather-Aware Adaptive Fusion (WeathFusion), a dynamic gating mechanism that utilizes vehicle metadata and weather-conditioned prompts to re-weight sensor contributions based on real-time environmental reliability. To ensure structural consistency, a Depth-Aware Geometric Alignment (DAGA) loss is employed to align dense camera-derived geometry with sparse, spatially accurate LiDAR returns. Extensive experiments on the nuScenes and SemanticKITTI datasets demonstrate that our plug-and-play modules consistently enhance the performance of state-of-the-art voxel-based baselines. Notably, our approach achieves significant improvements in challenging weather scenarios, offering a scalable and robust solution for complex urban navigation.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References26

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

VLMFusionOcc3D: VLM Assisted Multi-Modal 3D Semantic Occupancy Prediction

Related Papers