Search papers, labs, and topics across Lattice.
This study investigates the internal dynamics of reward models in reinforcement learning from human feedback (RLHF), focusing on the tension between helpfulness and harmlessness objectives. By analyzing models trained under different objective settings, the authors reveal that mixed-objective models often underperform compared to single-objective counterparts due to interference between conflicting goals. Activation-based methods uncover that while neurons associated with each objective support their respective goals, many neurons are shared and disproportionately influence model behavior, highlighting the complexities of multi-objective alignment in AI systems.
Mixed-objective reward models not only underperform single-objective ones but also reveal shared neurons that create significant alignment tension.
Reward models are a key component of reinforcement learning from human feedback (RLHF), aligning language models toward both helpful and harmless behaviour. However, the internal mechanisms underlying these objectives and their conflicts remain poorly understood. We study alignment tension in reward models trained under helpfulness-only, harmlessness-only, and mixed-objective settings. We find that mixed-objective models often underperform single-objective models, indicating interference between objectives. Using activation-based methods, we identify neurons associated with each objective and study their functional roles via targeted ablations. We find that these neurons causally support their corresponding objectives while often negatively affecting the opposing one. We find that a substantial proportion of neurons are shared between helpfulness and harmlessness, and that these shared neurons exert a disproportionate influence on model behaviour, contributing to alignment tension. Additionally, our results provide insights and mechanistic interpretation into how alignment objectives are represented in reward models and why multi-objective alignment remains challenging, motivating future work on disentangled and controllable alignment methods.