Masashi Sugiyama

RIKEN AIP, Japan Abstract Optimizing policies based on human preferences is the key to aligning language models with human intent. This work focuses on reward modeling, a core component in reinforcement learning from human feedback (RLHF), and offline preference optimization, such as direct preference optimization. Conventional approaches typically assume accurate annotations. However, real-world preference data often contains noise due to human errors or biases. We propose a principled framework for robust policy optimization under noisy preferences based on the view of reward modeling as a classification problem. This viewpoint allows us to leverage symmetric losses, well known for their robustness to the label noise in classification, for reward modeling, which leads to our Symmetric Preference Optimization (SymPO) method, a novel offline preference optimization algorithm. Theoretically, we prove that symmetric losses enable successful policy optimization even with noisy labels, as the resulting reward is rank-preserving—a property we identify as sufficient for policy improvement. Empirical evaluations on a synthetic dataset and real-world language model alignment tasks demonstrate the efficacy of SymPO. The code is available at https://github.com/nissymori/SymPO. 1 Introduction Policy optimization with human preferences aims to train a policy that aligns with human desires, given pairs of actions (a1,a2)subscript𝑎1subscript𝑎2(a_{1},a_{2})( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) and annotations indicating which action is preferred (a1≻a2succeedssubscript𝑎1subscript𝑎2a_{1}\succ a_{2}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT or a1≺a2precedessubscript𝑎1subscript𝑎2a_{1}\prec a_{2}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≺ italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT) (Ouyang et al., 2022; Stiennon et al., 2020; Rafailov et al., 2024). This paradigm has become increasingly prominent in language model alignment, where the goal is to develop models that behave according to human values, preferences, and instructions. A central component of this process is reward modeling, which involves learning an underlying reward function from preference data. Reward modeling is the foundation for two major approaches to policy optimization using preference data: Reinforcement Learning from Human Feedback (RLHF) (Ouyang et al., 2022) and offline preference optimization (Tang et al., 2024). In RLHF, a reward model is first trained and then used to fine-tune the policy via on-policy RL methods. In contrast, offline preference optimization directly learns the policy from the collected preference data based on a reward modeling objective, as exemplified by Direct Preference Optimization (DPO) (Rafailov et al., 2024). Most existing methods assume that preference labels are accurate. However, real-world preference data often suffer from noise due to annotation errors or systematic biases (Gao et al., 2024). This issue has garnered particular attention in offline preference optimization (Gao et al., 2024; Chowdhury et al., 2024; Liang et al., 2024; Wu et al., 2024; Fisch et al., 2024). These methods either require the prior knowledge of the noise rate (Chowdhury et al., 2024) or entail additional hyperparameters (Liang et al., 2024; Wu et al., 2024) to be tuned. Furthermore, they assume symmetric noise (Van Rooyen et al., 2015), where the preferences will flip with equal probability for a1≻a2succeedssubscript𝑎1subscript𝑎2a_{1}\succ a_{2}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT and a1≺a2precedessubscript𝑎1subscript𝑎2a_{1}\prec a_{2}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≺ italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. However, preference noise is often asymmetric in practice. For instance, an annotator may systematically favor responses from GPT-4 (Achiam et al., 2023) over those from GPT-3 (Brown et al., 2020) regardless of quality, introducing one-sided bias. In this paper, we propose a general framework for learning reward models from noisy preferences, grounded in the perspective of binary classification. Viewing reward modeling as a risk minimization in binary classification provides both methodological and theoretical benefits. Figure 1: The symmetric losses are plotted with a solid line, while the 0−1010-10 - 1 loss and convex losses with a dashed line. Methodologically, this classification viewpoint provides a principled way to handle asymmetric noise and enables the development of a robust objective function for reward modeling. Specifically, it makes an inherent structure in preference data explicit: swapping the positions of input pairs flips the label. Leveraging this, we show that asymmetric and symmetric noise are equivalent in reward modeling (Sec. 3.1). For robust reward modeling, we then employ symmetric losses, ℓ:ℝ→ℝ:ℓ→ℝℝ\ell:\mathbb{R}\rightarrow\mathbb{R}roman_ℓ : blackboard_R → blackboard_R, that satisfy so called the symmetric condition: ℓ⁢(z)+ℓ⁢(−z)=Kℓ𝑧ℓ𝑧𝐾\ell(z)+\ell(-z)=Kroman_ℓ ( italic_z ) + roman_ℓ ( - italic_z ) = italic_K for some constant K𝐾Kitalic_K as shown in Fig. 1 (Sec. 3.2). Symmetric losses are proven to be robust against the symmetric label noise in binary classification settings (Van Rooyen et al., 2015; Charoenphakdee et al., 2019), and we extend this result to show their robustness to the general asymmetric noise in reward modeling. This gives rise to our method: Symmetric Preference Optimization (SymPO), a novel offline preference optimization algorithm based on the symmetric loss. In our theoretical analysis, we uncover a connection between a property of loss functions in binary classification and the success of policy optimization. We first pinpoint a sufficient condition of the reward for policy improvement in policy optimization: rank preservation meaning that actions’ ordering matches the true underlying reward function (Sec. 4.1). Next, we prove that minimizing the risk with classification-calibrated losses (Bartlett et al., 2006) leads to rank preservation, showing that a broad class of loss functions—including symmetric ones—are appropriate for policy optimization. By combining the robustness of symmetric losses with this policy improvement guarantee, we provide theoretical justification for our method, SymPO (Sec. 4.2). Finally, we validate our approach through two experiments: a synthetic setup based on MNIST and language model alignment tasks. In the MNIST setting, we evaluate the robustness of symmetric losses for reward modeling. The language model alignment experiment shows that SymPO facilitates robust policy improvement with noisy preference data. Related Work. Research on noise-robust preference optimization has been extensive. Chowdhury et al. (2024) proposed Robust DPO (rDPO), which adapts noisy label learning techniques (Natarajan et al., 2013) used in statistical machine learning to construct an unbiased loss from noisy preference data, given prior knowledge of the noise rate. Wu et al. (2024) categorized noise in preference data as pointwise and pairwise and proposed Distributionally Robust DPO (DrDPO), which applies distributionally robust optimization (Duchi & Namkoong, 2021) to address these types of noise. Liang et al. (2024) introduced Robust Preference Optimization (ROPO), which combines noisy sample filtering, rejection sampling, and sigmoid-based regularization processes. Unlike these works, we address general asymmetric noise and prove its equivalence to symmetric noise in risk minimization. Furthermore, our method, SymPO, achieves robustness with theoretical guarantees for policy improvement while maintaining methodological simplicity. While prior work (Tang et al., 2024) also framed reward modeling as a binary classification task, it focused on convex losses, such as logistic, exponential, and squared losses (see Fig. 1). In contrast, we explore non-convex symmetric losses for their robustness to noisy labels. Please refer to App. A for a complete list of related works. Contributions. The main contributions of this paper are as follows: 1) We leverage the binary classification perspective of reward modeling to prove the equivalence between asymmetric and symmetric noise, and we propose a new offline preference optimization method, SymPO, that exploits the robustness of symmetric losses. 2) We bridge the gap between classification theory and policy optimization by proving that classification-calibrated losses yield policy improvement, thereby supporting the use of symmetric losses in policy optimization. 3) We validate the robustness of the symmetric losses in reward modeling and policy optimization through experiments. 2 Preliminaries Sec. 2.1 establishes reward modeling as the grounding concept for policy optimization from the human preferences. We explain the conventional way to learn reward function from human preferences based on the Bradley-Terry (BT) model, and demonstrate how it supports two major policy optimization paradigms: RLHF and offline preference optimization. Sec. 2.2 then casts reward modeling as the binary classification, preparing our approach to deal with noisy preferences. 2.1 Reward Modeling and Policy Optimization In reward modeling with the Bradley-Terry (BT) model (Bradley & Terry, 1952), we assume for a pair of actions (a1,a2)∈𝒜×𝒜subscript𝑎1subscript𝑎2𝒜𝒜(a_{1},a_{2})\in\mathcal{A}\times\mathcal{A}( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∈ caligraphic_A × caligraphic_A (𝒜⊂ℝd𝒜superscriptℝ𝑑\mathcal{A}\subset\mathbb{R}^{d}caligraphic_A ⊂ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT and |𝒜|<∞𝒜\left\lvert{\mathcal{A}}\right\rvert<\infty| caligraphic_A | < ∞), the preference a1≻a2succeedssubscript𝑎1subscript𝑎2a_{1}\succ a_{2}italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT is given by p⁢(a1≻a2)=σ⁢(rtrue⁢(a1)−rtrue⁢(a2)),𝑝succeedssubscript𝑎1subscript𝑎2𝜎subscript𝑟truesubscript𝑎1subscript𝑟truesubscript𝑎2p(a_{1}\succ a_{2})=\sigma(r_{\mathrm{true}}(a_{1})-r_{\mathrm{true}}(a_{2})),italic_p ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = italic_σ ( italic_r start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_r start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) , (1) where rtrue:𝒜→ℝ:subscript𝑟true→𝒜ℝr_{\mathrm{true}}:\mathcal{A}\to\mathbb{R}italic_r start_POSTSUBSCRIPT roman_true end_POSTSUBSCRIPT : caligraphic_A → blackboard_R is an underlying reward function and σ𝜎\sigmaitalic_σ is the sigmoid function. Reward Modeling. Based on the BT model, given preference data in the form of 𝒟={(a1i≻a2i)}i=1n𝒟superscriptsubscriptsucceedssubscriptsuperscript𝑎𝑖1subscriptsuperscript𝑎𝑖2𝑖1𝑛\mathcal{D}=\{(a^{i}_{1}\succ a^{i}_{2})\}_{i=1}^{n}caligraphic_D = { ( italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ≻ italic_a start_POSTSUPERSCRIPT italic_i end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, we train the reward model r:𝒜→ℝ:𝑟→𝒜ℝr:\mathcal{A}\to\mathbb{R}italic_r : caligraphic_A → blackboard_R by minimizing the following objective: ℒ⁢(r)=−𝔼(a1,a2)∼𝒟⁢[log⁡σ⁢(r⁢(a1)−r⁢(a2))],ℒ𝑟subscript𝔼similar-tosubscript𝑎1subscript𝑎2𝒟delimited-[]𝜎𝑟subscript𝑎1𝑟subscript𝑎2\mathcal{L}(r)=-\mathbb{E}_{(a_{1},a_{2})\sim\mathcal{D}}\left[{\log\sigma(r(a% _{1})-r(a_{2}))}\right],caligraphic_L ( italic_r ) = - blackboard_E start_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ roman_log italic_σ ( italic_r ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) - italic_r ( italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ) ] , (2) where 𝔼(a1,a2)∼𝒟⁢[⋅]subscript𝔼similar-tosubscript𝑎1subscript𝑎2𝒟delimited-[]⋅\mathbb{E}_{(a_{1},a_{2})\sim\mathcal{D}}\left[{\cdot}\right]blackboard_E start_POSTSUBSCRIPT ( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ∼ caligraphic_D end_POSTSUBSCRIPT [ ⋅ ] is the empirical average over the preference dataset 𝒟𝒟\mathcal{D}caligraphic_D. As explained subsequently, reward modeling plays a central role in two primary policy optimization approaches. Reinforcement Learning from Human Feedback (RLHF). In RLHF (Ouyang et al., 2022), we first train a reward function via Eq. (2). Then, we optimize a policy π:𝒜→[0,1]:𝜋→𝒜01\pi:\mathcal{A}\to[0,1]italic_π : caligraphic_A → [ 0 , 1 ] based on the Kullback-Leibler (KL) regularized reward maximization problem as, The University of Tokyo, Japan

Papers on Lattice

Total citations

Topics