Search papers, labs, and topics across Lattice.
This paper investigates data poisoning attacks in learning from human preferences, specifically focusing on teaching a target policy via synthesized preference data. It analyzes the sample complexity required for successful attacks against two paradigms: Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). The main results provide theoretical lower and upper bounds on the number of poisoned samples needed to enforce a target policy, highlighting the susceptibility of these methods to such attacks.
RLHF and DPO are surprisingly vulnerable to data poisoning, with even a small number of carefully crafted preferences capable of steering the learned policy towards a desired (potentially harmful) target.
We study data poisoning attacks in learning from human preferences. More specifically, we consider the problem of teaching/enforcing a target policy $\pi^\dagger$ by synthesizing preference data. We seek to understand the susceptibility of different preference-based learning paradigms to poisoned preference data by analyzing the number of samples required by the attacker to enforce $\pi^\dagger$. We first propose a general data poisoning formulation in learning from human preferences and then study it for two popular paradigms, namely: (a) reinforcement learning from human feedback (RLHF) that operates by learning a reward model using preferences; (b) direct preference optimization (DPO) that directly optimizes policy using preferences. We conduct a theoretical analysis of the effectiveness of data poisoning in a setting where the attacker is allowed to augment a pre-existing dataset and also study its special case where the attacker can synthesize the entire preference dataset from scratch. As our main results, we provide lower/upper bounds on the number of samples required to enforce $\pi^\dagger$. Finally, we discuss the implications of our results in terms of the susceptibility of these learning paradigms under such data poisoning attacks.