Mar 13, 2025arXiv:2503.10228

Policy Teaching via Data Poisoning in Learning from Human Preferences

Andi Nika, Jonathan Nöther, Debmalya Mandal, Parameswaran Kamalaruban, A. Singla, Goran Radanovic

AI Summary

This paper investigates data poisoning attacks in learning from human preferences, specifically focusing on teaching a target policy via synthesized preference data. It analyzes the sample complexity required for successful attacks against two paradigms: Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO). The main results provide theoretical lower and upper bounds on the number of poisoned samples needed to enforce a target policy, highlighting the susceptibility of these methods to such attacks.

Key Contribution

RLHF and DPO are surprisingly vulnerable to data poisoning, with even a small number of carefully crafted preferences capable of steering the learned policy towards a desired (potentially harmful) target.

Abstract

We study data poisoning attacks in learning from human preferences. More specifically, we consider the problem of teaching/enforcing a target policy $\pi^\dagger$ by synthesizing preference data. We seek to understand the susceptibility of different preference-based learning paradigms to poisoned preference data by analyzing the number of samples required by the attacker to enforce $\pi^\dagger$. We first propose a general data poisoning formulation in learning from human preferences and then study it for two popular paradigms, namely: (a) reinforcement learning from human feedback (RLHF) that operates by learning a reward model using preferences; (b) direct preference optimization (DPO) that directly optimizes policy using preferences. We conduct a theoretical analysis of the effectiveness of data poisoning in a setting where the attacker is allowed to augment a pre-existing dataset and also study its special case where the attacker can synthesize the entire preference dataset from scratch. As our main results, we provide lower/upper bounds on the number of samples required to enforce $\pi^\dagger$. Finally, we discuss the implications of our results in terms of the susceptibility of these learning paradigms under such data poisoning attacks.

Data Curation & Synthetic Data Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Citation Metrics

Citations1

Influential citations0

References60

Year2025

VenueInternational Conference on Artificial Intelligence and Statistics

Related Papers

Finding related papers...

Search

Policy Teaching via Data Poisoning in Learning from Human Preferences

Related Papers