Search papers, labs, and topics across Lattice.
This paper presents EPS3D, an innovative end-to-end feed-forward framework for open-vocabulary 3D panoptic segmentation that eliminates the need for preprocessing by utilizing a distillation-based training strategy on diverse 3D scenes. The architecture enhances 3D consistency and mitigates error accumulation through a mutual enhancement module that aligns semantic and instance features. EPS3D achieves significant improvements over state-of-the-art baselines, such as a 13% increase in mean Intersection over Union (mIoU) for semantics on the Replica benchmark, while maintaining high efficiency at one second per scene.
EPS3D achieves a remarkable 13% boost in semantic segmentation accuracy while processing 3D scenes in just one second.
This paper introduces EPS3D, a new end-to-end feed-forward framework for open-vocabulary 3D panoptic segmentation. Unlike existing methods relying on additional preprocessing, we design an end-to-end architecture, with a distillation-based training strategy on diverse 3D scenes to predict 3D-aware semantic and instance features from multi-view images, improving 3D consistency and avoiding error accumulation. We further propose a mutual enhancement module to enforce inherent semantic-instance consistency. By aligning semantics within instances (Ins2Sem) and refining instance features with semantic guidance (Sem2Ins), we achieve more coherent 3D scene understanding. Ultimately, EPS3D outperforms SOTA baselines on two benchmarks (e.g., +13% mIoU for semantics on Replica) with high efficiency (e.g., 1s per scene), supporting tasks like robotic manipulation and 3D scene editing.