Soongsil UniversityMay 21, 2026arXiv:2605.22651

What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining

AI Summary

This paper introduces Counterfactual Phrase Intervention (CPI), a novel phrase-level data curation framework for vision-language pretraining that identifies and ranks image-text pairs based on the sensitivity of the image-text score to controlled nonce-token substitutions within caption phrases. CPI addresses the saturation of global alignment signals in web-scale data by focusing on compositional supervision, i.e., whether individual object, attribute, and relation phrases materially support the image-text match. Experiments on CC3M demonstrate that CPI-ranked subsets outperform full-data baselines and alignment-only filtering, improving compositional generalization on VL-CheckList-VG Relation by +1.91 and showing further gains when applied to NegCLIP.

Key Contribution

Stop relying on global image-text alignment scores for vision-language pretraining data curation – a phrase-level sensitivity signal reveals a 50% data subset that substantially boosts compositional generalization.

Abstract

CLIP-style contrastive pretraining typically curates web-scale image-text pairs using sample-level filtering signals, often based on pair-level alignment. We show that this signal saturates: once coarse mismatches are removed, stricter global filtering no longer tracks the compositional supervision provided by the retained captions. The reason is structural - a global score conflates whether a pair is broadly plausible with whether the individual object, attribute, and relation phrases inside the caption materially support the image-text match. The latter is what compositional generalization demands, yet pair-level filters are blind to it. We address this with Counterfactual Phrase Intervention (CPI), a phrase-level curation framework that converts controlled nonce-token substitutions into image-conditioned phrase-sensitivity scores. CPI uses global alignment only for coarse mismatch removal, then ranks the surviving pool by whether caption phrases measurably affect the image-text score under controlled substitution. We frame CPI as a first-order phrase-sensitivity signal rather than a grounding or identification result, and evaluate it at CC3M scale. Ranking by this signal yields a 50%-data subset that improves VL-CheckList-VG Relation by +1.91 over the full-data baseline and +1.00 over alignment-only filtering at matched budget, while improving SugarCrepe overall and preserving general transfer. CPI is loss-orthogonal: applied unchanged to NegCLIP, it further improves VL-CheckList-VG Relation by +3.84, with additional CE-CLIP gains in the main text.

Computer Vision Data Curation & Synthetic Data Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

What Does the Caption Really Say? Counterfactual Phrase Intervention for Compositional Data Selection in Vision-Language Pretraining

Related Papers