Feb 12, 2026arXiv:2602.11804

Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data

Yiming Zhou, Xuenjie Xie, Panfeng Li, Albrecht Kunz, Ahmad Osman, Xavier Maldague

AI Summary

This paper introduces a lightweight RGB-D fusion framework to improve the efficiency and accuracy of Segment Anything Models (SAM). They augment EfficientViT-SAM with monocular depth priors generated by a pretrained estimator, fusing depth information mid-level with RGB features using a dedicated depth encoder. Training on only 11.2k samples, the proposed method outperforms EfficientViT-SAM, demonstrating the effectiveness of depth cues as geometric priors for segmentation.

Key Contribution

Depth-aware fusion lets you train Segment Anything models on just 0.1% of the original data while *improving* accuracy.

Abstract

Segment Anything Models (SAM) achieve impressive universal segmentation performance but require massive datasets (e.g., 11M images) and rely solely on RGB inputs. Recent efficient variants reduce computation but still depend on large-scale training. We propose a lightweight RGB-D fusion framework that augments EfficientViT-SAM with monocular depth priors. Depth maps are generated with a pretrained estimator and fused mid-level with RGB features through a dedicated depth encoder. Trained on only 11.2k samples (less than 0.1\% of SA-1B), our method achieves higher accuracy than EfficientViT-SAM, showing that depth cues provide strong geometric priors for segmentation.

Computer Vision Multimodal Models Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References19

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Efficient Segment Anything with Depth-Aware Fusion and Limited Training Data

Related Papers