Search papers, labs, and topics across Lattice.
This paper introduces Detailed 3D Referring Expression Segmentation (3D-DRES), a novel task that maps noun phrases to 3D instances to facilitate fine-grained 3D vision-language understanding. To support this task, the authors created DetailRefer, a dataset with 54,432 descriptions and phrase-instance annotations for 11,054 objects. They also propose DetailBase, a baseline architecture for dual-mode segmentation, demonstrating improved performance on both phrase-level segmentation and traditional 3D-RES benchmarks.
Unlock compositional reasoning in 3D vision-language models with a new dataset that maps noun phrases to 3D instances, revealing improvements on both fine-grained and traditional segmentation tasks.
Current 3D visual grounding tasks only process sentence level detection or segmentation, which critically fails to leverage the rich compositional contextual reasonings within natural language expressions. To address this challenge, we introduce Detailed 3D Referring Expression Segmentation (3D-DRES), a new task that provides a phrase to 3D instance mapping, aiming at enhancing fine-grained 3D vision language understanding. To support 3D-DRES, we present DetailRefer, a new dataset comprising 54,432 descriptions spanning 11,054 distinct objects. Unlike previous datasets, DetailRefer implements a pioneering phrase-instance annotation paradigm where each referenced noun phrase is explicitly mapped to its corresponding 3D elements. Additionally, we introduce DetailBase, a purposefully streamlined yet effective baseline architecture that supports dual-mode segmentation at both sentence and phrase levels. Our experimental results demonstrate that models trained on DetailRefer not only excel at phrase-level segmentation but also show surprising improvements on traditional 3D-RES benchmarks.