Search papers, labs, and topics across Lattice.
This paper introduces Alignment-Aware Masked Learning (AML) for Referring Image Segmentation (RIS), which improves performance by explicitly estimating pixel-level vision-language alignment. AML filters out poorly aligned regions during training, focusing optimization on areas with strong vision-language correspondence. Experiments demonstrate state-of-the-art results on RefCOCO datasets and improved robustness to diverse descriptions.
Stop letting noisy vision-language alignment ruin your referring image segmentation: AML filters out the bad parts.
Referring Image Segmentation (RIS) aims to segment an object in an image identified by a natural language expression. The paper introduces Alignment-Aware Masked Learning (AML), a training strategy to enhance RIS by explicitly estimating pixel-level vision-language alignment, filtering out poorly aligned regions during optimization, and focusing on trustworthy cues. This approach results in state-of-the-art performance on RefCOCO datasets and also enhances robustness to diverse descriptions and scenarios