Apr 23, 2026arXiv:2604.21806

TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Yongqi Li, Liqiang Nie

AI Summary

This paper introduces two new datasets, M-FashionIQ and M-CIRR, to address the limitations of existing Composed Image Retrieval (CIR) datasets which lack complex, multi-modification instructions. To handle these complex queries, they propose TEMA, a Text-oriented Entity Mapping Architecture, designed to effectively process both simple and multi-modification text instructions alongside reference images. Experiments demonstrate TEMA's superior retrieval accuracy and computational efficiency compared to existing methods across four benchmark datasets.

Key Contribution

Multi-modification image retrieval is now possible: TEMA handles complex, real-world instructions that go beyond simple changes, outperforming existing methods on new datasets M-FashionIQ and M-CIRR.

Abstract

Composed Image Retrieval (CIR) is an important image retrieval paradigm that enables users to retrieve a target image using a multimodal query that consists of a reference image and modification text. Although research on CIR has made significant progress, prevailing setups still rely simple modification texts that typically cover only a limited range of salient changes, which induces two limitations highly relevant to practical applications, namely Insufficient Entity Coverage and Clause-Entity Misalignment. In order to address these issues and bring CIR closer to real-world use, we construct two instruction-rich multi-modification datasets, M-FashionIQ and M-CIRR. In addition, we propose TEMA, the Text-oriented Entity Mapping Architecture, which is the first CIR framework designed for multi-modification while also accommodating simple modifications. Extensive experiments on four benchmark datasets demonstrate that TEMA's superiority in both original and multi-modification scenarios, while maintaining an optimal balance between retrieval accuracy and computational efficiency. Our codes and constructed multi-modification dataset (M-FashionIQ and M-CIRR) are available at https://github.com/lee-zixu/ACL26-TEMA/.

Computer Vision Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References87

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TEMA: Anchor the Image, Follow the Text for Multi-Modification Composed Image Retrieval

Related Papers