BaiduBJTUNTUUMacauUQJun 15, 2026arXiv:2606.16354

GraphBEV++: Multi-Modal Feature Alignment for Autonomous Driving

Ziying Song, Caiyan Jia, Lin Liu, Shaoqing Xu, Lei Yang, Yadan Luo

AI Summary

This paper introduces GraphBEV++, a multi-modal fusion framework designed to address feature misalignment in Bird's Eye View (BEV) perception for autonomous driving, particularly under sensor calibration uncertainties. The framework employs two modules, LocalAlign-v2 and GlobalAlign-v2, to correct both local and global misalignments through innovative graph matching and noise injection techniques. Experimental results indicate that GraphBEV++ not only achieves state-of-the-art performance on benchmark datasets like nuScenes and Waymo but also enhances long-range detection and occupancy estimation accuracy in various driving scenarios.

Key Contribution

GraphBEV++ outperforms five existing baselines in addressing critical misalignment issues, significantly boosting performance in both perception and planning tasks for autonomous vehicles.

Abstract

Feature misalignment in BEV perception is a critical yet often overlooked challenge in autonomous driving, especially under calibration uncertainties between LiDAR and camera sensors. To address this issue, we propose a robust multi-modal fusion framework, GraphBEV++, which systematically mitigates projection-induced misalignment. The framework consists of two key modules: LocalAlign-v2 and GlobalAlign-v2. LocalAlign-v2 introduces neighborhood-aware depth features via graph matching to correct local misalignment. It supports both LSS-based and query-based BEV representations, making it compatible with BEVFusion and BEVFormer architectures for consistent cross-paradigm alignment. GlobalAlign-v2 encompasses two variants: Deformable and Diffusion. The Deformable variant addresses global misalignment in LSS-based multi-modal BEV by explicitly learning cross-modal feature offsets. In contrast, the Diffusion variant targets implicit misalignment in query-based BEV by injecting noise to simulate misalignment and employing a denoising process to recover aligned features. Experimental results show that GraphBEV++ achieves state-of-the-art performance under misalignment noise on nuScenes and Waymo subset, improves long-range detection on Argoverse2, and generalizes effectively to the 3D occupancy prediction task, consistently improving occupancy estimation accuracy and robustness under both clean and noisy settings. Furthermore, GraphBEV++ effectively alleviates misalignment issues in end-to-end autonomous driving. Compared with five baselines (UniAD, VAD, FusionAD, MomAD, and WoTE), it demonstrates superior performance in both open-loop (nuScenes) and closed-loop (Bench2Drive and NAVSIM) evaluations across perception, prediction, and planning tasks.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GraphBEV++: Multi-Modal Feature Alignment for Autonomous Driving

Related Papers