Tsinghua AIApr 16, 2026arXiv:2604.15090

Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID

AI Summary

This paper introduces Semantic-driven Token Filtering and Expert Routing (STFER) for Any-Time Person Re-identification (AT-ReID), addressing the challenge of performance degradation due to modality shifts and clothing changes. STFER leverages Large Vision-Language Models (LVLMs) to generate identity-consistent text descriptions, which are then used to filter visual tokens and guide expert routing, enhancing robustness to visual variations. Experiments on the AT-USTC dataset and other ReID benchmarks demonstrate state-of-the-art results and superior generalization capabilities.

Key Contribution

Forget relying on fickle visuals: this new ReID method uses language to describe *who* a person is, not just what they look like, and it crushes existing benchmarks.

Abstract

Any-Time Person Re-identification (AT-ReID) necessitates the robust retrieval of target individuals under arbitrary conditions, encompassing both modality shifts (daytime and nighttime) and extensive clothing-change scenarios, ranging from short-term to long-term intervals. However, existing methods are highly relying on pure visual features, which are prone to change due to environmental and time factors, resulting in significantly performance deterioration under scenarios involving illumination caused modality shifts or cloth-change. In this paper, we propose Semantic-driven Token Filtering and Expert Routing (STFER), a novel framework that leverages the ability of Large Vision-Language Models (LVLMs) to generate identity consistency text, which provides identity-discriminative features that are robust to both clothing variations and cross-modality shifts between RGB and IR. Specifically, we employ instructions to guide the LVLM in generating identity-intrinsic semantic text that captures biometric constants for the semantic model driven. The text token is further used for Semantic-driven Visual Token Filtering (SVTF), which enhances informative visual regions and suppresses redundant background noise. Meanwhile, the text token is also used for Semantic-driven Expert Routing (SER), which integrates the semantic text into expert routing, resulting in more robust multi-scenario gating. Extensive experiments on the Any-Time ReID dataset (AT-USTC) demonstrate that our model achieves state-of-the-art results. Moreover, the model trained on AT-USTC was evaluated across 5 widely-used ReID benchmarks demonstrating superior generalization capabilities with highly competitive results. Our code will be available soon.

Computer Vision Multimodal Models Recommendation & Information Retrieval

Citation Metrics

Citations0

Influential citations0

References64

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond Visual Cues: Semantic-Driven Token Filtering and Expert Routing for Anytime Person ReID

Related Papers