UWJun 2, 2025arXiv:2506.01937

RewardBench 2: Advancing Reward Model Evaluation

Saumya Malik, Valentina Pyatkin, Sander Land, Jacob Daniel Morrison, Noah A. Smith, Hanna Hajishirzi, Nathan Lambert

AI Summary

The paper introduces RewardBench 2, a new benchmark for evaluating reward models across multiple skills, featuring challenging data derived from novel human prompts. It addresses the gap between reward model evaluation and their effectiveness in downstream tasks by providing a more rigorous assessment of reward model accuracy. The benchmark demonstrates a strong correlation with downstream performance in both inference-time scaling and RLHF training, while showing a significant performance drop compared to the original RewardBench.

Key Contribution

RewardBench 2 exposes a stark reality check for reward models: they struggle significantly on new, human-generated prompts, yet this difficulty is surprisingly predictive of their actual usefulness in downstream tasks.

Abstract

Reward models are used throughout the post-training of language models to capture nuanced signals from preference data and provide a training target for optimization across instruction following, reasoning, safety, and more domains. The community has begun establishing best practices for evaluating reward models, from the development of benchmarks that test capabilities in specific skill areas to others that test agreement with human preferences. At the same time, progress in evaluation has not been mirrored by the effectiveness of reward models in downstream tasks -- simpler direct alignment algorithms are reported to work better in many cases. This paper introduces RewardBench 2, a new multi-skill reward modeling benchmark designed to bring new, challenging data for accuracy-based reward model evaluation -- models score about 20 points on average lower on RewardBench 2 compared to the first RewardBench -- while being highly correlated with downstream performance. Compared to most other benchmarks, RewardBench 2 sources new human prompts instead of existing prompts from downstream evaluations, facilitating more rigorous evaluation practices. In this paper, we describe our benchmark construction process and report how existing models perform on it, while quantifying how performance on the benchmark correlates with downstream use of the models in both inference-time scaling algorithms, like best-of-N sampling, and RLHF training algorithms like proximal policy optimization.

Eval Frameworks & Benchmarks RLHF & Preference Learning

Citation Metrics

Citations53

Influential citations12

References72

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

RewardBench 2: Advancing Reward Model Evaluation

Related Papers