Search papers, labs, and topics across Lattice.
The paper introduces KeySeg, a novel architecture for keyword-conditioned image segmentation that addresses the semantic discrepancy between language interpretation and visual segmentation by explicitly encoding inferred query conditions. KeySeg uses a dedicated [KEY] token, fused with a [SEG] token via cross-attention, to integrate core concepts extracted from multimodal inputs into the segmentation process. The introduction of a keyword alignment loss guides the [KEY] token to align with the semantic core of the input query, improving condition interpretation accuracy and leading to more precise segmentation.
KeySeg tackles semantic discrepancies in multimodal image segmentation by explicitly encoding query conditions into a dedicated token, leading to more precise and interpretable segmentation.
Advancements in multimodal large language models have opened up new possibilities for reasoning-based image segmentation by jointly processing visual and linguistic information. However, existing approaches often suffer from a semantic discrepancy between language interpretation and visual segmentation as a result of the lack of a structural connection between query understanding and segmentation execution. To address this issue, we propose a keyword-conditioned image segmentation model (KeySeg) as a novel architecture that explicitly encodes and integrates inferred query conditions into the segmentation process. KeySeg embeds the core concepts extracted from multimodal inputs into a dedicated [KEY] token, which is then fused with a [SEG] token through a cross-attention-based fusion module. This design enables the model to reflect query conditions explicitly and precisely in the segmentation criteria. Additionally, we introduce a keyword alignment loss that guides the [KEY] token to align closely with the semantic core of the input query, thereby enhancing the accuracy of condition interpretation. By separating the roles of condition reasoning and segmentation instruction, and making their interactions explicit, KeySeg achieves both expressive capacity and interpretative stability, even under complex language conditions.