Search papers, labs, and topics across Lattice.
This paper introduces a novel architecture for glass segmentation that combines general visual features from a frozen DINOv3 model with task-specific features from a supervised Swin model. The multi-scale feature representations are processed with residual Squeeze-and-Excitation Channel Reduction and fed into a Mask2Former Decoder for final segmentation. Evaluated on four datasets, the approach achieves state-of-the-art results on several accuracy metrics with competitive inference speed.
Achieve state-of-the-art glass segmentation by fusing frozen DINOv3 features with supervised Swin features, outperforming prior methods in accuracy and speed.
Glass surface segmentation from RGB images is a challenging task, since glass as a transparent material distinctly lacks visual characteristics. However, glass segmentation is critical for scene understanding and robotics, as transparent glass surfaces must be identified as solid material. This paper presents a novel architecture for glass segmentation, deploying a dual-backbone producing general visual features as well as task-specific learned visual features. General visual features are produced by a frozen DINOv3 vision foundation model, and the task-specific features are generated with a Swin model trained in a supervised manner. Resulting multi-scale feature representations are downsampled with residual Squeeze-and-Excitation Channel Reduction, and fed into a Mask2Former Decoder, producing the final segmentation masks. The architecture was evaluated on four commonly used glass segmentation datasets, achieving state-of-the-art results on several accuracy metrics. The model also has a competitive inference speed compared to the previous state-of-the-art method, and surpasses it when using a lighter DINOv3 backbone variant. The implementation source code and model weights are available at: https://github.com/ojalar/lgnet