Feb 25, 2026arXiv:2602.22033

RT-RMOT: A Dataset and Framework for RGB-Thermal Referring Multi-Object Tracking

Zhifan Jin, Sijia Chen, Tongfei Chu, Liman Liu

AI Summary

The paper introduces a new RGB-Thermal Referring Multi-Object Tracking (RT-RMOT) task to address limitations of existing RMOT methods in low-visibility conditions. To facilitate research in this area, the authors create RefRT, the first RGB-Thermal RMOT dataset, comprising 388 language descriptions, 1,250 tracked targets, and 166,147 L-RGB-T triplets. They also propose RTrack, a multimodal large language model-based framework, and demonstrate its effectiveness on the RefRT dataset, further enhancing it with Group Sequence Policy Optimization (GSPO) and Clipped Advantage Scaling (CAS) strategies.

Key Contribution

Now you can track multiple objects in low-visibility conditions using language descriptions, thanks to a new RGB-Thermal dataset and MLLM-based framework.

Abstract

Referring Multi-Object Tracking has attracted increasing attention due to its human-friendly interactive characteristics, yet it exhibits limitations in low-visibility conditions, such as nighttime, smoke, and other challenging scenarios. To overcome this limitation, we propose a new RGB-Thermal RMOT task, named RT-RMOT, which aims to fuse RGB appearance features with the illumination robustness of the thermal modality to enable all-day referring multi-object tracking. To promote research on RT-RMOT, we construct the first Referring Multi-Object Tracking dataset under RGB-Thermal modality, named RefRT. It contains 388 language descriptions, 1,250 tracked targets, and 166,147 Language-RGB-Thermal (L-RGB-T) triplets. Furthermore, we propose RTrack, a framework built upon a multimodal large language model (MLLM) that integrates RGB, thermal, and textual features. Since the initial framework still leaves room for improvement, we introduce a Group Sequence Policy Optimization (GSPO) strategy to further exploit the model's potential. To alleviate training instability during RL fine-tuning, we introduce a Clipped Advantage Scaling (CAS) strategy to suppress gradient explosion. In addition, we design Structured Output Reward and Comprehensive Detection Reward to balance exploration and exploitation, thereby improving the completeness and accuracy of target perception. Extensive experiments on the RefRT dataset demonstrate the effectiveness of the proposed RTrack framework.

Computer Vision Data Curation & Synthetic Data Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

RT-RMOT: A Dataset and Framework for RGB-Thermal Referring Multi-Object Tracking

Related Papers