Mar 30, 2026arXiv:2603.28281

Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

Andi Nika, Debmalya Mandal, Parameswaran Kamalaruban, Adish Singla, Goran Radanović

AI Summary

This paper studies the problem of offline multi-agent reinforcement learning from human feedback (MARLHF) when a fraction of the training data is adversarially corrupted. They develop robust estimators for Nash equilibrium gaps under both uniform and unilateral coverage assumptions, achieving error bounds scaling with the corruption rate ε. To address computational intractability, they propose a quasi-polynomial-time algorithm for finding coarse correlated equilibria (CCE) with an error bound of O(√ε) under unilateral coverage.

Key Contribution

Even with corrupted human feedback, surprisingly tight guarantees for multi-agent reinforcement learning are possible.

Abstract

We consider robustness against data corruption in offline multi-agent reinforcement learning from human feedback (MARLHF) under a strong-contamination model: given a dataset $D$ of trajectory-preference tuples (each preference being an $n$-dimensional binary label vector representing each of the $n$ agents' preferences), an $ε$-fraction of the samples may be arbitrarily corrupted. We model the problem using the framework of linear Markov games. First, under a uniform coverage assumption - where every policy of interest is sufficiently represented in the clean (prior to corruption) data - we introduce a robust estimator that guarantees an $O(ε^{1 - o(1)})$ bound on the Nash equilibrium gap. Next, we move to the more challenging unilateral coverage setting, in which only a Nash equilibrium and its single-player deviations are covered. In this case, our proposed algorithm achieves an $O(\sqrtε)$ bound on the Nash gap. Both of these procedures, however, suffer from intractable computation. To address this, we relax our solution concept to coarse correlated equilibria (CCE). Under the same unilateral coverage regime, we derive a quasi-polynomial-time algorithm whose CCE gap scales as $O(\sqrtε)$. To the best of our knowledge, this is the first systematic treatment of adversarial data corruption in offline MARLHF.

Red-Teaming & Adversarial Robustness RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Corruption-robust Offline Multi-agent Reinforcement Learning From Human Feedback

Related Papers