Apr 23, 2026arXiv:2604.21324

Temporal Prototyping and Hierarchical Alignment for Unsupervised Video-based Visible-Infrared Person Re-Identification

Zhiyong Li, Wei Jiang, Haojie Liu, Mingyu Wang, Wanchong Xu, Wei Mao

AI Summary

This paper tackles the problem of unsupervised video-based visible-infrared person re-identification (VI-ReID) by proposing a novel framework called HiTPro (Hierarchical Temporal Prototyping). HiTPro constructs intra-camera prototypes from temporally partitioned sub-tracklets and performs hierarchical cross-prototype alignment with dynamic thresholding and soft weight assignment to mine positive pairs. The framework is trained with a hierarchical contrastive learning objective across intra-camera, cross-camera same-modality, and cross-modality levels, achieving state-of-the-art results on HITSZ-VCM and BUPTCampus datasets.

Key Contribution

Unsupervised video-based person re-identification is now possible without hard pseudo-label assignments, thanks to a hierarchical temporal prototyping approach that significantly outperforms existing methods.

Abstract

Visible-infrared person re-identification (VI-ReID) enables cross-modality identity matching for all-day surveillance, yet existing methods predominantly focus on the image level or rely heavily on costly identity annotations. While video-based VI-ReID has recently emerged to exploit temporal dynamics for improved robustness, existing studies remain limited to supervised settings. Crucially, the unsupervised video VI-ReID problem, where models must learn from RGB and infrared tracklets without identity labels, remains largely unexplored despite its practical importance in real-world deployment. To bridge this gap, we propose HiTPro (Hierarchical Temporal Prototyping), a prototype-driven framework without explicit hard pseudo-label assignment for unsupervised video-based VI-ReID. HiTPro begins with an efficient Temporal-aware Feature Encoder that first extracts discriminative frame-level features and then aggregates them into a robust tracklet-level representation. Building upon these features, HiTPro first constructs reliable intra-camera prototypes via Intra-Camera Tracklet Prototyping by aggregating features from temporally partitioned sub-tracklets. Through Hierarchical Cross-Prototype Alignment, we perform a two-stage positive mining process: progressing from within-modality associations to cross-modality matching, enhanced by Dynamic Threshold Strategy and Soft Weight Assignment. Finally, {Hierarchical Contrastive Learning} progressively optimizes feature-prototype alignment across three levels: intra-camera discrimination, cross-camera same-modality consistency, and cross-modality invariance. Extensive experiments on HITSZ-VCM and BUPTCampus demonstrate that HiTPro achieves state-of-the-art performance under fully unsupervised settings, significantly outperforming adapted baselines and establishes a strong baseline for future research.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References66

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Temporal Prototyping and Hierarchical Alignment for Unsupervised Video-based Visible-Infrared Person Re-Identification

Related Papers