Ant GroupHKUSTLiangzhu LaboratoryShanghai InnovationWeDoctor CloudZJUApr 7, 2026arXiv:2604.05445

Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling

Qiyuan Chen, Hongsen Huang, Jiahe Chen, Qian Shao, Jintai Chen, Hongxia Xu, Renjie Hua, Chuan-Ying Ren

AI Summary

The paper introduces VL-MDR, a vision-language reward modeling framework that decomposes reward prediction into interpretable dimensions like hallucination and reasoning, using a visual-aware gating mechanism to select and weight relevant dimensions for each input. To train VL-MDR, the authors curate a dataset of 321k vision-language preference pairs annotated across 21 fine-grained dimensions. Experiments demonstrate that VL-MDR outperforms existing reward models and enables effective DPO alignment to mitigate visual hallucinations in VLMs.

Key Contribution

Stop training black-box reward models: VL-MDR offers a transparent alternative that surfaces *why* a VLM is getting a certain reward, opening the door to more targeted alignment.

Abstract

Vision-language reward modeling faces a dilemma: generative approaches are interpretable but slow, while discriminative ones are efficient but act as opaque"black boxes."To bridge this gap, we propose VL-MDR (Vision-Language Multi-Dimensional Reward), a framework that dynamically decomposes evaluation into granular, interpretable dimensions. Instead of outputting a monolithic scalar, VL-MDR employs a visual-aware gating mechanism to identify relevant dimensions and adaptively weight them (e.g., Hallucination, Reasoning) for each specific input. To support this, we curate a dataset of 321k vision-language preference pairs annotated across 21 fine-grained dimensions. Extensive experiments show that VL-MDR consistently outperforms existing open-source reward models on benchmarks like VL-RewardBench. Furthermore, we show that VL-MDR-constructed preference pairs effectively enable DPO alignment to mitigate visual hallucinations and improve reliability, providing a scalable solution for VLM alignment.

Interpretability & Mechanistic Interp Multimodal Models RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References22

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling

Related Papers