Search papers, labs, and topics across Lattice.
This paper introduces MACCO, a novel framework that enhances the compositional understanding of vision-language models by masking compositional concepts in one modality and reconstructing them using contextual information from the other modality. The approach addresses the limitations of existing models, which often fail to capture object relations and dependencies due to their reliance on single-vector representations. Experimental results across five compositional benchmarks show that MACCO significantly improves compositionality, syntactic structure understanding, and benefits applications like text-to-image generation and multimodal large language models.
Masking compositional concepts in one modality while leveraging contextual cues from another can dramatically enhance the compositionality of vision-language models.
Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a"bag-of-words"behavior--struggling to capture the object relations, attribute-object bindings, and word order dependencies. This limitation arises not only from the reliance on global, single-vector representations for optimization, but also from the insufficient exploitation and modeling of the rich compositional information inherently present in paired image text data. In this work, we propose MACCO (MAsked Compositional Concept MOdeling), a framework that masks compositional concepts in one modality and reconstructs them conditioned on the full contextual information from the other, enabling the model to capture and align cross-modal compositional structures more effectively. To facilitate this process, we introduce two auxiliary objectives that jointly align and regularize masked features both inter-modally and intra-modally. Extensive experiments on five compositional benchmarks, along with in-depth analyses, demonstrate that our approach not only significantly enhances compositionality in VLMs but also improves their ability to capture syntactic structure and linguistic information. Additionally, the improved compositionality also benefits text-to-image generation and multimodal large language model. Code is available at https://github.com/hiker-lw/MACCO.