DAMOJun 8, 2026arXiv:2606.09076

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

Xin Jin, Huanqia Cai, Zhen Li, Zechao Zhan, Dengyang Jiang, Aiming Hao, Yuming Jiang, Chunle Guo, Peng Gao, Ming-Ming Cheng, Steven C. H. Hoi

AI Summary

This paper introduces Z-Reward, a teacher-student framework that enhances reward modeling in text-to-image generation by representing visual preferences as distributions over rubric scores instead of deterministic scalars. By employing Group-wise Direct Score Optimization (GDSO) for the teacher model and Reasoning-Internalized Score Distillation (RISD) for the student model, the authors achieve significant improvements in human preference accuracy, with the teacher model reaching 89.6% and the student model achieving 88.6%. The framework not only outperforms existing methods but also provides a differentiable reward signal that enhances text-to-image optimization by 41.3% over the standard fine-tuning baseline.

Key Contribution

Z-Reward achieves nearly 90% human preference accuracy by transforming subjective visual preferences into nuanced score distributions, outperforming traditional reward models.

Abstract

Reward models are central to text-to-image post-training, but visual preference is subjective and better represented as a distribution over rubric scores than as a deterministic scalar. Existing scalar, score-token, and pairwise reward models over-compress uncertainty and fine-grained score differences, while reasoning-based generative rewards provide stronger judgments but are costly to deploy and difficult to use as direct optimization signals. We propose Z-Reward, a teacher-student reward modeling framework that decouples reasoning-heavy judgment from efficient reward deployment. The teacher is a large VLM that uses reasoning to infer rubric-aligned score distributions, and is trained with Group-wise Direct Score Optimization (GDSO), which combines policy-gradient rewards from distribution expectations with direct pointwise and pairwise supervision on score distributions and score gaps. The student is trained with Reasoning-Internalized Score Distillation (RISD), which transfers the teacher's reasoning-conditioned score distribution into a compact VLM without requiring explicit reasoning chains at inference time. On our internally annotated evaluation set, the 27B GDSO teacher reaches 89.6% human preference accuracy, outperforming SFT, RewardDance, and GRPO, while the 9B RISD student reaches 88.6%, outperforming the OPD baseline and closely matching the larger teacher. We further show that Z-Reward can serve as a differentiable reward signal for text-to-image optimization, yielding a 41.3% net human-preference improvement over the SFT baseline.

Reasoning & Chain-of-Thought RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

Related Papers