CaltechFeb 13, 2026arXiv:2602.13195

Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision

AI Summary

The paper introduces Conversational Image Segmentation (CIS) to address the limitations of existing referring image grounding methods, which primarily focus on categorical and spatial queries, by incorporating functional, physical, and intent-driven reasoning. To facilitate research in this area, the authors create ConverSeg, a new benchmark dataset spanning a diverse range of reasoning types, and ConverSeg-Net, a model that combines segmentation priors with language understanding. The authors demonstrate that ConverSeg-Net, when trained using their AI-powered data engine for generating prompt-mask pairs, significantly outperforms existing models on the new ConverSeg benchmark while maintaining strong performance on existing benchmarks.

Key Contribution

Current language-guided segmentation models fall short when reasoning about function, safety, and intent, but a new model and training data engine close the gap.

Abstract

Conversational image segmentation grounds abstract, intent-driven concepts into pixel-accurate masks. Prior work on referring image grounding focuses on categorical and spatial queries (e.g.,"left-most apple") and overlooks functional and physical reasoning (e.g.,"where can I safely store the knife?"). We address this gap and introduce Conversational Image Segmentation (CIS) and ConverSeg, a benchmark spanning entities, spatial relations, intent, affordances, functions, safety, and physical reasoning. We also present ConverSeg-Net, which fuses strong segmentation priors with language understanding, and an AI-powered data engine that generates prompt-mask pairs without human supervision. We show that current language-guided segmentation models are inadequate for CIS, while ConverSeg-Net trained on our data engine achieves significant gains on ConverSeg and maintains strong performance on existing language-guided segmentation benchmarks. Project webpage: https://glab-caltech.github.io/converseg/

Computer Vision Multimodal Models Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References41

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Conversational Image Segmentation: Grounding Abstract Concepts with Scalable Supervision

Related Papers