AI LaboratoryShanghai AI LabSJTUJun 9, 2026arXiv:2606.10594

Segment and Select: Vision-Language Segmentation in 3D Scenarios

AI Summary

This paper introduces the SEGment-And-select (SEGA3D) paradigm for 3D vision-language segmentation, which operates directly on fine-grained visual information rather than relying on coarse superpoint representations. By employing a mask candidate generator and a Large Language Model (LLM) to enhance semantic and spatial understanding, SEGA3D significantly improves segmentation quality. The method achieves state-of-the-art results, outperforming previous benchmarks by notable margins on ScanNet and Matterport3D datasets, demonstrating its effectiveness in real-world applications.

Key Contribution

SEGA3D achieves an impressive 8.3 mIoU improvement over previous methods, redefining the standards for 3D vision-language segmentation.

Abstract

3D vision-language segmentation aims to segment target objects in 3D scenarios according to the linguistic instructions and visual observations. Prior art heavily relies on the coarse superpoint representation to reduce the computation complexity, which suffers from poor segmentation quality and messy object boundaries. In this paper, we propose the SEGment-And-select (SEGA3D) paradigm for 3D visionlanguage segmentation that directly operates on the fine-grained visual information and is free from the superpoint dependency. Specifically, we first leverage a mask candidate generator to provide fine-grained categorical mask candidates, substantially improving the quality of candidate masks over the superpoint counterparts. Then, a Large Language Model (LLM) is utilized to generate the semantic and spatial information based on the linguistic description and visual features. The LLM output and visual features are fed to the Semantic-Spatial Selector (SSS) to produce the top-ranking mask candidates. Eventually, the Loopback Verification Module (LVM) is designed to yield the segmentation mask from the selected candidate masks. Our SEGA3D attains competitive performance on ScanRefer, ScanNet and Matterport3D benchmarks. Notably, our SEGA3D surpasses the top-performing counterpart by 8.3 mIoU and 5.3 mIoU on ScanNet and Matterport3D, respectively. Codes will be available upon publication.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Segment and Select: Vision-Language Segmentation in 3D Scenarios

Related Papers