Keimyung UniversityOct 1, 2025

Keyword-Conditioned Image Segmentation via the Cross-Attentive Alignment of Language and Vision Sensor Data

AI Summary

The paper introduces KeySeg, a novel architecture for keyword-conditioned image segmentation that addresses the semantic discrepancy between language interpretation and visual segmentation by explicitly encoding inferred query conditions. KeySeg uses a dedicated [KEY] token, fused with a [SEG] token via cross-attention, to integrate core concepts extracted from multimodal inputs into the segmentation process. The introduction of a keyword alignment loss guides the [KEY] token to align with the semantic core of the input query, improving condition interpretation accuracy and leading to more precise segmentation.

Key Contribution

KeySeg tackles semantic discrepancies in multimodal image segmentation by explicitly encoding query conditions into a dedicated token, leading to more precise and interpretable segmentation.

Abstract

Advancements in multimodal large language models have opened up new possibilities for reasoning-based image segmentation by jointly processing visual and linguistic information. However, existing approaches often suffer from a semantic discrepancy between language interpretation and visual segmentation as a result of the lack of a structural connection between query understanding and segmentation execution. To address this issue, we propose a keyword-conditioned image segmentation model (KeySeg) as a novel architecture that explicitly encodes and integrates inferred query conditions into the segmentation process. KeySeg embeds the core concepts extracted from multimodal inputs into a dedicated [KEY] token, which is then fused with a [SEG] token through a cross-attention-based fusion module. This design enables the model to reflect query conditions explicitly and precisely in the segmentation criteria. Additionally, we introduce a keyword alignment loss that guides the [KEY] token to align closely with the semantic core of the input query, thereby enhancing the accuracy of condition interpretation. By separating the roles of condition reasoning and segmentation instruction, and making their interactions explicit, KeySeg achieves both expressive capacity and interpretative stability, even under complex language conditions.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations1

Influential citations0

References40

Year2025

VenueItalian National Conference on Sensors

Related Papers

Finding related papers...

Search

Keyword-Conditioned Image Segmentation via the Cross-Attentive Alignment of Language and Vision Sensor Data

Related Papers