Apr 13, 2026arXiv:2604.11626

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

Haozhe Wang, Cong Wei, Weiming Ren, Jiaming Liu, Fangzhen Lin, Wenhu Chen

AI Summary

This paper introduces RationalRewards, a reward model for visual generation that produces explicit, multi-dimensional critiques before scoring, enabling both improved reinforcement learning during training and a Generate-Critique-Refine loop at test time. They train the reward model using Preference-Anchored Rationalization (PARROT), a framework that recovers rationales from preference data via anchored generation, consistency filtering, and distillation. The resulting 8B parameter model achieves state-of-the-art preference prediction among open-source reward models and demonstrates that the critique-and-refine loop can match or exceed RL-based fine-tuning.

Key Contribution

Forget RL fine-tuning – RationalRewards unlocks latent image generation capabilities at test time simply by having the model critique and refine its own prompts.

Abstract

Most reward models for visual generation reduce rich human judgments to a single unexplained score, discarding the reasoning that underlies preference. We show that teaching reward models to produce explicit, multi-dimensional critiques before scoring transforms them from passive evaluators into active optimization tools, improving generators in two complementary ways: at training time, structured rationales provide interpretable, fine-grained rewards for reinforcement learning; at test time, a Generate-Critique-Refine loop turns critiques into targeted prompt revisions that improve outputs without any parameter updates. To train such a reward model without costly rationale annotations, we introduce Preference-Anchored Rationalization (PARROT), a principled framework that recovers high-quality rationales from readily available preference data through anchored generation, consistency filtering, and distillation. The resulting model, RationalRewards (8B), achieves state-of-the-art preference prediction among open-source reward models, competitive with Gemini-2.5-Pro, while using 10-20x less training data than comparable baselines. As an RL reward, it consistently improves text-to-image and image-editing generators beyond scalar alternatives. Most strikingly, its test-time critique-and-refine loop matches or exceeds RL-based fine-tuning on several benchmarks, suggesting that structured reasoning can unlock latent capabilities in existing generators that suboptimal prompts fail to elicit.

Computer Vision Interpretability & Mechanistic Interp Multimodal Models RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References53

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

RationalRewards: Reasoning Rewards Scale Visual Generation Both Training and Test Time

Related Papers