Jun 18, 2026arXiv:2606.20044

FUSE: Frequency-domain Unification and Spectral Energy Alignment for Multi-modal Object Re-Identification

Xuanhao Qi, Tom H. Luan, Yukang Zhang, Jinkai Zheng, Zhou Su, Shuwei Li, Lei Tan

AI Summary

This paper introduces FUSE, a novel frequency-domain framework for multi-modal object re-identification (ReID) that addresses the limitations of existing methods by focusing on mid and high-frequency features. By employing a two-stage process of spectral disentanglement and energy alignment, FUSE enhances the robustness and stability of cross-modal alignment through its Spectral Decomposition Module and Cross-Modal Alignment Module. Experimental results on multiple datasets demonstrate significant improvements in mean Average Precision (mAP) and Rank-1 accuracy, highlighting the effectiveness of the frequency-domain approach in multi-modal representation learning.

Key Contribution

FUSE achieves a remarkable 9.1% improvement in mAP by leveraging mid and high-frequency features, challenging the conventional focus on low-frequency cues in multi-modal ReID.

Abstract

Despite significant progress in multi-modal Re-Identification (ReID), existing methods tend to emphasize low-frequency cues. Consequently, they focus on attributes such as color, illumination, and coarse appearance, while overlooking mid and high-frequency structures that encode geometric, textural, and identity-discriminative details. This imbalance leads to incomplete spectral representations and unstable cross-modal alignment. To overcome these limitations, we introduce FUSE, a frequency-domain framework that reformulates multi-modal ReID as a two-stage process of spectral disentanglement and energy alignment. The proposed Spectral Decomposition Module (SDM) adaptively partitions features into low, mid, and high-frequency subspaces, enabling hierarchical spectral modeling. The Cross-Modal Alignment Module (CAM) further enforces energy alignment and subspace complementarity across modalities via frequency-consistency regularization. In addition, FUSE incorporates learnable frequency modulation to enhance robustness under varying illumination and heterogeneous sensor conditions. Extensive experiments on RGBNT201, RGBNT100, and MSVR310 show that FUSE achieves 9.1\% mAP and 9.5\% Rank-1 improvements, establishing an interpretable frequency-domain paradigm for multi-modal representation learning.

Computer Vision Multimodal Models

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

FUSE: Frequency-domain Unification and Spectral Energy Alignment for Multi-modal Object Re-Identification

Related Papers