NUSNTUMay 25, 2026arXiv:2605.25629

When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift

Khoi Le, Tri Cao, Phong Nguyen, Cong-Duy Nguyen, Anh Tuan Luu, Miao Chunyan, See-Kiong Ng, Thong Nguyen

AI Summary

This paper investigates weak-to-strong (W2S) reward model generalization under zero-shot distribution shift, revealing that strong models trained on weak preferences can fail to transfer across preference datasets despite strong in-distribution performance. The authors identify a representational failure mode where weak supervision pulls the strong model towards source-domain features. To address this, they introduce Representation Anchoring (Anchor), a regularization technique that constrains drift from the pre-trained representation space, leading to improved out-of-distribution transfer.

Key Contribution

Weak-to-strong reward models can ace the test but still fail in the real world, revealing a hidden brittleness in current preference learning approaches.

Abstract

Weak-to-strong (W2S) generalization is a promising framework for scalable oversight, yet existing evaluations often test students under matched train--test distributions. Therefore, we study W2S preference learning under zero-shot distribution shift and find that strong students trained on weak preference labels can appear successful in-distribution while failing to transfer across preference datasets. We provide evidence for a representational failure mode in which weak-supervised fine-tuning can pull the strong model toward source-domain features instead of maintaining broadly transferable preference representations. To mitigate this, we propose Representation Anchoring (Anchor), a simple yet effective regularizer that constrains excessive drift from the pretrained strong model's representation space during fine-tuning, while still allowing task-relevant adaptation. Across preference domains, datasets, and model families, Anchor consistently improves out-of-distribution transfer while maintaining competitive in-distribution performance. Together, our evaluation protocol, transfer-aware metrics, and method expose hidden brittleness in current W2S reward modeling and provide a practical path toward more robust preference transfer.

Eval Frameworks & Benchmarks RLHF & Preference Learning Scalable Oversight & Alignment Theory

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

When In-Distribution Gains Fail: Evaluating Weak-to-Strong Reward Models under Preference Shift

Related Papers