UBCVectorMay 24, 2026arXiv:2605.25189

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Wenlong Deng, Jiaji Huang, Kaan Ozkara, Yushu Li, Christos Thrampoulidis, Xiaoxiao Li, Youngsuk Park

AI Summary

This paper investigates reward hacking in RL fine-tuning of LMs by analyzing the geometry of parameter updates, finding that hacking episodes exhibit larger directional changes in parameter space compared to intended learning. To mitigate this, they introduce "trusted-direction projection," which constrains gradients during RL to remain within a subspace defined by a clean, non-hacking reference trajectory. Experiments on mathematical reasoning tasks demonstrate that this projection delays reward hacking and improves task performance.

Key Contribution

Reward hacking isn't just about incentives, it's about wild directional swings in your model's parameter space – and constraining those swings can keep your LM on the straight and narrow.

Abstract

Reward hacking arises when a model improves a proxy reward by exploiting shortcuts rather than solving the intended task. We study this failure mode through the geometry of reinforcement learning updates in language models and argue that hacking emerges when optimization drifts away from a stable low-dimensional learning trajectory. We analyze this drift through dominant singular directions of parameter updates and show that reward-hacking runs exhibit substantially larger directional change than clean runs. Motivated by this observation, we introduce trusted-direction projection, which constrains gradients to remain within a clean reference subspace. Across reward-hacking experiments on mathematical reasoning, the proposed approach delays shortcut exploitation and better preserves task performance.

Interpretability & Mechanistic Interp RLHF & Preference Learning Scalable Oversight & Alignment Theory

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Related Papers