Search papers, labs, and topics across Lattice.
The paper identifies prompt ambiguity as a key bottleneck in referring image segmentation pipelines that combine VLMs and promptable segmenters like SAM. They demonstrate that naively sampled interior points within a bounding box prompt often degrade performance due to boundary proximity and background clutter. To address this, they introduce PinPoint, a training-free point selector that leverages visual cues to identify stable and informative interior points, significantly improving segmentation accuracy.
Forget training: PinPoint, a training-free point selector, closes the performance gap between zero-shot VLMs and fine-tuned specialists in referring image segmentation by intelligently choosing interior points for prompting SAM.
Modern referring image segmentation pipelines couple a vision-language model (VLM) for grounding with a promptable segmenter such as the Segment Anything Model (SAM) for mask generation. Prior training-free instances of this recipe consistently trail fine-tuned and reinforcement-learning (RL)-tuned specialists, and it has been unclear whether the gap comes from the VLM's grounding, SAM's capacity, or the prompt. We show that the gap is dominated by prompt ambiguity: a VLM-proposed bounding box (bbox) leaves SAM to guess which pixels inside the bbox belong to the object the expression denotes. Interior points are the natural disambiguator, but where they fall matters; prior work relies on naively sampled points that land on boundaries, distractors, and background clutter, and can even hurt performance compared to the bbox alone. Supervised and RL-tuned methods close this gap by training a VLM to predict better points; we show that this training is unnecessary. At a matched budget of five interior points, replacing naive sampling with stable, informative point selection improves cumulative Intersection-over-Union (cIoU) by 12-18 points across RefCOCO/+/g, with every model fixed. We turn this observation into PinPoint, a deterministic, training-free point selector that fuses four visual cues into a consensus map, selects compact, spatially diverse points away from boundaries, and uses the frozen VLM to label each point. Without any task-specific training, PinPoint matches supervised and RL-tuned specialists on the same stack while issuing only two VLM calls per query.