MIT CSAILFeb 1, 2026arXiv:2602.03876

GOPO: Policy Optimization using Ranked Rewards

Kyuseong Choi, Dwaipayan Saha, Woojeong Kim, Anish Agarwal, Raaz Dwivedi

AI Summary

The paper introduces Group Ordinal Policy Optimization (GOPO), a policy optimization method designed for reinforcement learning from human feedback (RLHF) that addresses the misalignment between reward models trained on relative preferences and policy optimization techniques that rely on absolute reward magnitudes. GOPO transforms rewards into a rank-based representation, discarding magnitude information and focusing solely on the ordinal relationships between rewards. Empirical results demonstrate that GOPO achieves higher reward trajectories, improved LLM-as-judge evaluations, and faster convergence compared to Group Relative Policy Optimization (GRPO) across various tasks and model sizes.

Key Contribution

Ditching reward magnitudes for rankings unlocks faster and better RLHF, especially when judging quality is subjective.

Abstract

Standard reinforcement learning from human feedback (RLHF) trains a reward model on pairwise preference data and then uses it for policy optimization. However, while reward models are optimized to capture relative preferences, existing policy optimization techniques rely on absolute reward magnitudes during training. In settings where the rewards are non-verifiable such as summarization, instruction following, and chat completion, this misalignment often leads to suboptimal performance. We introduce Group Ordinal Policy Optimization (GOPO), a policy optimization method that uses only the ranking of the rewards and discards their magnitudes. Our rank-based transformation of rewards provides several gains, compared to Group Relative Policy Optimization (GRPO), in settings with non-verifiable rewards: (1) consistently higher training/validation reward trajectories, (2) improved LLM-as-judge evaluations across most intermediate training steps, and (3) reaching a policy of comparable quality in substantially less training steps than GRPO. We demonstrate consistent improvements across a range of tasks and model sizes.

Citation Metrics

Citations0

Influential citations0

References29

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

GOPO: Policy Optimization using Ranked Rewards

Related Papers