Search papers, labs, and topics across Lattice.
This paper introduces DMPG-BEV, a novel camera-based BEV perception framework that leverages denoising diffusion implicit models (DDIMs) to generate point cloud features with enhanced 3D information. To reduce computational costs, the framework employs DDIMs with fewer denoising steps and incorporates a lightweight view transformation module. Experiments on the nuScenes dataset demonstrate that DMPG-BEV achieves improved performance with reduced computation compared to existing camera-based BEV perception methods.
Camera-based BEV perception gets a boost: DDIMs efficiently generate point cloud features, slashing computation while enhancing performance on nuScenes.
Bird’s eye view (BEV) perception requires swift and accurate representation of real-world 3-D information, and BEV-based 3-D object detection and semantic segmentation are foundational for downstream autonomous tasks such as prediction and planning. Current BEV perception methods are divided into light detection and ranging (LiDAR)-based and camera-based methods depending on the input data modalities. LiDAR-based methods offer precise 3-D information, but are limited by high sensor costs. Conversely, camera-based methods are economical and widely researched. However, 2-D image features transform into BEV space, introducing feature distortion due to inaccurate 3-D information, particularly in depth estimation, which results in noisy projections. Denoising diffusion probabilistic models (DDPMs) are used to mitigate noise in BEV features. However, DDPM increases computational costs, relies on noisy images and BEV features, and limits effectiveness. To address these issues, we propose a diffusion model-based point clouds features generation for efficient camera-based BEV perception (DMPG-BEV) to enhance the performance of camera-based methods. This framework employs denoising diffusion implicit models (DDIMs) to generate point cloud features with 3-D information, while reducing computational demands by fewer denoising steps. We also introduce a lightweight feature fusion module that integrates point cloud BEV features with image BEV features to incorporate 3-D information and reduce noise. The refined BEV features are also well-suited for multitask networks involving detection and segmentation tasks. Additionally, a lightweight view transformation module is included to further reduce computational costs. Extensive experiments on the nuScenes benchmark demonstrate better performance of our proposed model with reduced computation, highlighting its value in BEV perception.