Search papers, labs, and topics across Lattice.
This paper introduces a novel salient object segmentation method that integrates the Segment Anything Model (SAM) with depth information and cross-modal attention. The approach leverages SAM for feature extraction and a pre-trained depth estimation network to capture geometric information, dynamically fusing RGB and depth features using cross-modal attention. The method employs LoRA fine-tuning for computational efficiency and a UNet decoder to refine segmentation, achieving state-of-the-art results on five benchmark datasets, particularly in complex scenarios.
Ditching reliance on real depth sensors, this method achieves state-of-the-art salient object segmentation by cleverly synthesizing depth from RGB and fusing it with SAM features via cross-modal attention.
Accurate segmentation of salient objects is crucial for various computer vision applications including image editing, autonomous driving, and object detection. While research on using depth information (RGB-D) in saliency detection is gaining significant attention, its broad application is limited by dependencies on depth sensors and the challenge of effectively integrating RGB and depth information. To address these issues, we propose an innovative method for salient object segmentation that integrates the Segment Anything Model (SAM), depth information, and cross-modal attention mechanisms. Our approach leverages SAM for robust feature extraction and combines it with a pre-trained depth estimation network to capture geometric information. By dynamically fusing features from RGB and depth modalities through a cross-modal attention mechanism, our method enhances the ability to handle diverse scenes. Additionally, we achieve computational efficiency without compromising precision by employing lightweight LoRA fine-tuning and freezing pre-trained weights. The use of a UNet decoder refines the segmentation output, ensuring the preservation of target boundary details in high-resolution outputs. Experiments conducted on five challenging benchmark datasets validate the effectiveness of our proposed method. Results show significant improvements over existing methods across key evaluation metrics, including MaxF, MAE, and S-measure. Particularly in tasks involving complex backgrounds, small targets, and multiple salient object segmentation, our method demonstrates superior performance and robustness. The significance of this work lies in advancing the application of depth-guided RGB in salient object segmentation while offering new insights into overcoming depth sensor dependency. Furthermore, it opens up novel pathways for the effective fusion of cross-modal information, thereby contributing to the broader development and diversification of related technologies and their applications.