BIGAIWHUApr 22, 2026arXiv:2604.20429

Fast-then-Fine: A Two-Stage Framework with Multi-Granular Representation for Cross-Modal Retrieval in Remote Sensing

Xi Chen, Xiangyang Jia, Xu Zhang, Shuquan Wei, Wei Wang

AI Summary

This paper introduces a "fast-then-fine" (FTF) two-stage framework for remote sensing image-text retrieval, designed to improve both efficiency and accuracy. The FTF framework first uses text-agnostic coarse-grained representations for efficient candidate selection, then employs a parameter-free balanced text-guided interaction block to enhance fine-grained alignment in a reranking stage. Experiments on public benchmarks show that FTF achieves competitive retrieval accuracy with significantly improved retrieval efficiency compared to existing methods.

Key Contribution

Achieve state-of-the-art remote sensing image-text retrieval without the computational burden of large-scale vision-language model pre-training, thanks to a novel two-stage approach.

Abstract

Remote sensing (RS) image-text retrieval plays a critical role in understanding massive RS imagery. However, the dense multi-object distribution and complex backgrounds in RS imagery make it difficult to simultaneously achieve fine-grained cross-modal alignment and efficient retrieval. Existing methods either rely on complex cross-modal interactions that lead to low retrieval efficiency, or depend on large-scale vision-language model pre-training, which requires massive data and computational resources. To address these issues, we propose a fast-then-fine (FTF) two-stage retrieval framework that decomposes retrieval into a text-agnostic recall stage for efficient candidate selection and a text-guided rerank stage for fine-grained alignment. Specifically, in the recall stage, text-agnostic coarse-grained representations are employed for efficient candidate selection; in the rerank stage, a parameter-free balanced text-guided interaction block enhances fine-grained alignment without introducing additional learnable parameters. Furthermore, an inter- and intra-modal loss is designed to jointly optimize cross-modal alignment across multi-granular representations. Extensive experiments on public benchmarks demonstrate that the FTF achieves competitive retrieval accuracy while significantly improving retrieval efficiency compared with existing methods.

Computer Vision Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Fast-then-Fine: A Two-Stage Framework with Multi-Granular Representation for Cross-Modal Retrieval in Remote Sensing

Related Papers