StuttgartMay 6, 2026arXiv:2605.05003

Misaligned by Reward: Socially Undesirable Preferences in LLMs

AI Summary

The authors investigate whether reward models used in LLM alignment capture socially desirable preferences across bias, safety, morality, and ethical reasoning domains. They convert social evaluation datasets into pairwise preference data to test if reward models prefer socially undesirable responses and produce biased output distributions. Results show that existing reward models often prefer socially undesirable options, produce systematically biased distributions, and exhibit a trade-off between bias avoidance and contextual faithfulness.

Key Contribution

Current reward models often *prefer* socially undesirable responses, revealing a critical gap in LLM alignment beyond instruction following.

Abstract

Reward models are a key component of large language model alignment, serving as proxies for human preferences during training. However, existing evaluations focus primarily on broad instruction-following benchmarks, providing limited insight into whether these models capture socially desirable preferences. As a result, important failures in social alignment can remain hidden. We extend reward-model benchmarking to four socially consequential domains: bias, safety, morality, and ethical reasoning. We introduce a framework that converts social evaluation datasets into pairwise preference data, leveraging gold labels where available and directional bias indicators otherwise. This enables us to test whether reward models prefer socially undesirable responses, and whether their preferences produce systematically biased distributions over selected outputs. Across five publicly available reward models and two instruction-tuned models used as reward proxies, we find substantial variation across domains, with no single model performing best overall. The models fall well short of strong social intelligence: they often prefer socially undesirable options, and their preferences produce systematically biased distributions. Moreover, stronger bias avoidance can reduce sensitivity to context, revealing a key alignment trade-off between avoiding biased outcomes and preserving contextual faithfulness. These findings show that standard reward benchmarks are insufficient for assessing social alignment and highlight the need for evaluations that directly measure the social preferences encoded in reward models.

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Misaligned by Reward: Socially Undesirable Preferences in LLMs

Related Papers