Search papers, labs, and topics across Lattice.
This paper investigates the impact of incorporating depth information during pre-training of Vision Foundation Models (VFMs) for surgical scene understanding by pre-training and evaluating eight ViT-based models with different pre-training domains, objectives, and input modalities (RGB vs. RGB-D). The study demonstrates that models with explicit geometric tokenization, like MultiMAE, significantly outperform RGB-only baselines across various surgical tasks, including object detection, segmentation, depth estimation, and pose estimation. A key finding is the improved data efficiency of geometrically-aware pre-training, where models fine-tuned on only 25% of labeled data surpass RGB-only models trained on the full dataset.
Forget RGB-only pre-training for surgical robots: incorporating depth information during pre-training boosts performance and data efficiency without requiring architectural changes at inference time.
Vision foundation models (VFMs) have emerged as powerful tools for surgical scene understanding. However, current approaches predominantly rely on unimodal RGB pre-training, overlooking the complex 3D geometry inherent to surgical environments. Although several architectures support multimodal or geometry-aware inputs in general computer vision, the benefits of incorporating depth information in surgical settings remain underexplored. We conduct a large-scale empirical study comparing eight ViT-based VFMs that differ in pre-training domain, learning objective, and input modality (RGB vs. RGB-D). For pre-training, we use a curated dataset of 1.4 million robotic surgical images paired with depth maps generated from an off-the-shelf network. We evaluate these models under both frozen-backbone and end-to-end fine-tuning protocols across eight surgical datasets spanning object detection, segmentation, depth estimation, and pose estimation. Our experiments yield several consistent findings. Models incorporating explicit geometric tokenization, such as MultiMAE, substantially outperform unimodal baselines across all tasks. Notably, geometric-aware pre-training enables remarkable data efficiency: models fine-tuned on just 25% of labeled data consistently surpass RGB-only models trained on the full dataset. Importantly, these gains require no architectural or runtime changes at inference; depth is used only during pre-training, making adoption straightforward. These findings suggest that multimodal pre-training offers a viable path towards building more capable surgical vision systems.