FreiburgMercedes-Benz AGApr 6, 2026arXiv:2604.04797

Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving

Mayank Mayank, Bharanidhar Duraisamy, Florian Geiß, Abhinav Valada

AI Summary

MMF-BEV, a novel radar-camera fusion framework, leverages deformable attention to align cross-modal features in BEV space for enhanced 3D object detection. The architecture consists of BEVDepth camera and RadarBEVNet radar branches, each enhanced with deformable self-attention, fused via deformable cross-attention. Experiments on the VoD dataset demonstrate that MMF-BEV outperforms unimodal baselines and achieves competitive fusion performance, validated through sensor contribution analysis.

Key Contribution

Radar and camera don't just see more together, they see *differently* - and this fusion architecture shows exactly how to leverage each sensor's unique strengths for superior 3D object detection.

Abstract

Accurate 3D object detection for autonomous driving requires complementary sensors. Cameras provide dense semantics but unreliable depth, while millimeter-wave radar offers precise range and velocity measurements with sparse geometry. We propose MMF-BEV, a radar-camera BEV fusion framework that leverages deformable attention for cross-modal feature alignment on the View-of-Delft (VoD) 4D radar dataset [1]. MMF-BEV builds a BEVDepth [2] camera branch and a RadarBEVNet [3] radar branch, each enhanced with Deformable Self-Attention, and fuses them via a Deformable Cross-Attention module. We evaluate three configurations: camera-only, radar-only, and hybrid fusion. A sensor contribution analysis quantifies per-distance modality weighting, providing interpretable evidence of sensor complementarity. A two-stage training strategy - pre-training the camera branch with depth supervision, then jointly training radar and fusion modules stabilizes learning. Experiments on VoD show that MMF-BEV consistently outperforms unimodal baselines and achieves competitive results against prior fusion methods across all object classes in both the full annotated area and near-range Region of Interest.

Computer Vision Multimodal Models Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving

Related Papers