Tsinghua AISchool of Artificial IntelligenceZJUApr 13, 2026arXiv:2604.11234

Bridging the RGB-IR Gap: Consensus and Discrepancy Modeling for Text-Guided Multispectral Detection

Jiaqi Wu, Zhen Wang, Enhao Huang, Kangqing Shen, Yulin Wang, Yifan Pu, Gao Huang

AI Summary

This paper introduces a semantic bridge fusion framework with bi-support modeling to improve text-guided multispectral object detection by explicitly addressing the granularity asymmetry between RGB and IR data. The framework uses text as a shared semantic bridge to align RGB and IR responses and projects a recalibrated thermal semantic prior onto the RGB branch for semantic-level mapping fusion. It also models RGB-IR interaction evidence as both consensus and discrepancy support, dynamically recalibrating them during fusion to capture discriminative cues.

Key Contribution

By explicitly modeling both consensus and discrepancy between RGB and IR data, this text-guided multispectral object detector significantly boosts performance on multispectral benchmarks.

Abstract

Text-guided multispectral object detection uses text semantics to guide semantic-aware cross-modal interaction between RGB and IR for more robust perception. However, notable limitations remain: (1) existing methods often use text only as an auxiliary semantic enhancement signal, without exploiting its guiding role to bridge the inherent granularity asymmetry between RGB and IR; and (2) conventional data-driven attention-based fusion tends to emphasize stable consensus while overlooking potentially valuable cross-modal discrepancies. To address these issues, we propose a semantic bridge fusion framework with bi-support modeling for multispectral object detection. Specifically, text is used as a shared semantic bridge to align RGB and IR responses under a unified category condition, while the recalibrated thermal semantic prior is projected onto the RGB branch for semantic-level mapping fusion. We further formulate RGB-IR interaction evidence into the regular consensus support and the complementary discrepancy support that contains potentially discriminative cues, and introduce them into fusion via dynamic recalibration as a structured inductive bias. In addition, we design a bidirectional semantic alignment module for closed-loop vision-text guidance enhancement. Extensive experiments demonstrate the effectiveness of the proposed fusion framework and its superior detection performance on multispectral benchmarks. Code is available at https://github.com/zhenwang5372/Bridging-RGB-IR-Gap.

Computer Vision Multimodal Models Natural Language Processing

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Bridging the RGB-IR Gap: Consensus and Discrepancy Modeling for Text-Guided Multispectral Detection

Related Papers