Search papers, labs, and topics across Lattice.
The paper introduces Conversational Image Segmentation (CIS) to address the limitations of existing referring image grounding methods, which primarily focus on categorical and spatial queries, by incorporating functional, physical, and intent-driven reasoning. To facilitate research in this area, the authors create ConverSeg, a new benchmark dataset spanning a diverse range of reasoning types, and ConverSeg-Net, a model that combines segmentation priors with language understanding. The authors demonstrate that ConverSeg-Net, when trained using their AI-powered data engine for generating prompt-mask pairs, significantly outperforms existing models on the new ConverSeg benchmark while maintaining strong performance on existing benchmarks.
Current language-guided segmentation models fall short when reasoning about function, safety, and intent, but a new model and training data engine close the gap.
Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g.,"left-most apple") and overlooks functional and physical reasoning (e.g.,"where can I safely store the knife?"). We address this gap and introduce Conversational Image Segmentation (CIS) and ConverSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConverSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt-mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConverSeg-Net trained on our data engine achieves significant gains on ConverSeg and maintains strong performance on existing language-guided segmentation benchmarks. Project webpage: https://glab-caltech.github.io/converseg/