ZJUJun 1, 2026arXiv:2606.01825

ROGLE: Robust Global-Local Alignment with Automated Region Supervision for Text-Based Person Search

Zequn Xie, Xibei Jia, Sihang Cai, Shulei Wang, Tao Jin

AI Summary

This paper introduces ROGLE, a novel framework for Text-Based Person Search (TBPS) that enhances fine-grained alignment by utilizing an automated Region-to-Sentence Matching (RSM) strategy to generate pseudo region-sentence pairs, thereby reducing reliance on manual annotations. By integrating global contrastive learning with local alignment through a multi-granular approach, ROGLE addresses the limitations of existing TBPS models, particularly those based on CLIP, which suffer from global representational bias. The authors also present the P-VLG Benchmark, a comprehensive dataset featuring over 100,000 annotated regions and long-form captions, which facilitates both global and local evaluation, demonstrating that ROGLE significantly outperforms prior methods on complex queries.

Key Contribution

ROGLE achieves a breakthrough in Text-Based Person Search by automatically generating fine-grained supervision, outperforming existing models on challenging long-form queries.

Abstract

Text-Based Person Search (TBPS) aims to retrieve pedestrian images using natural language queries. However, existing TBPS models, especially those based on CLIP, struggle with fine-grained understanding due to global representational bias and semantic sparsity inherited from training on short captions. This results in weak fine-grained alignment, exacerbated by the scarcity of region-level annotations. To address this, we propose ROGLE (Robust Global-Local Embedding), a unified framework that overcomes reliance on costly manual annotations through an automated Region-to-Sentence Matching (RSM) strategy. RSM automatically mines pseudo region-sentence pairs for scalable fine-grained supervision. Furthermore, ROGLE employs a multi-granular learning strategy that fuses global contrastive learning with region-level local alignment. We also introduce the P-VLG Benchmark, a large-scale dataset constructed by curating and enriching images from established public benchmarks. It features over 100,000 annotated regions and rich long-form captions, making it the first TBPS benchmark to support both global and local assessment protocols. Extensive experiments show that ROGLE significantly outperforms existing approaches, particularly on challenging long-form queries. Code and the P-VLG benchmark will be made publicly available.

Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...