Manuscript received April 21USTCApr 21, 2026arXiv:2604.19544

DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

Zhihong Zhang, Jie Zhao, Xiaojian Huang, Jin Xu, Zhuodong Luo, Xin Liu, Jiansheng Wei, Xuejin Chen

AI Summary

The paper introduces DT2IT-MRM, a method for improving multimodal reward models (MRMs) by addressing issues in existing preference datasets, such as lack of granularity, style bias, and unreliable signals. DT2IT-MRM uses a debiased preference construction pipeline, reformulates text-to-image preference data, and employs an iterative training framework to curate existing datasets. Experiments demonstrate that DT2IT-MRM achieves state-of-the-art performance on VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench, indicating improved alignment of MLLMs with human preferences.

Key Contribution

Noisy multimodal preference datasets are holding back reward model performance, but DT2IT-MRM offers a scalable curation strategy that achieves state-of-the-art results.

Abstract

Multimodal reward models (MRMs) play a crucial role in aligning Multimodal Large Language Models (MLLMs) with human preferences. Training a good MRM requires high-quality multimodal preference data. However, existing preference datasets face three key challenges: lack of granularity in preference strength, textual style bias, and unreliable preference signals. Besides, existing open-source multimodal preference datasets suffer from substantial noise, yet there is a lack of effective and scalable curation methods to enhance their quality. To address these limitations, we propose \textbf{DT2IT-MRM}, which integrates a \textbf{D}ebiased preference construction pipeline, a novel reformulation of text-to-image (\textbf{T2I}) preference data, and an \textbf{I}terative \textbf{T}raining framework that curates existing multimodal preference datasets for \textbf{M}ultimodal \textbf{R}eward \textbf{M}odeling. Our experimental results show that DT2IT-MRM achieves new \textbf{state-of-the-art} overall performance on three major benchmarks: VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.

Data Curation & Synthetic Data Multimodal Models RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling

Related Papers