Search papers, labs, and topics across Lattice.
This paper investigates the exploitability of policies learned via multi-agent imitation learning (MA-IL) in n-player Markov Games, demonstrating that exact measure matching can fail to produce low-exploitable policies and proving a new hardness result for characterizing the Nash gap given a fixed measure matching error. To address these challenges, the authors introduce strategic dominance assumptions on the expert equilibrium, deriving a Nash imitation gap of $\mathcal{O}\left(nε_{\text{BC}}/(1-γ)^2\right)$ for dominant strategy expert equilibria with Behavioral Cloning error $ε_{\text{BC}}$. They further generalize this result using a novel notion of best-response continuity, suggesting that standard regularization techniques can implicitly encourage this property.
Even perfect imitation of expert behavior in multi-agent settings can lead to highly exploitable policies, unless experts play dominant strategies or exhibit best-response continuity.
Multi-agent imitation learning (MA-IL) aims to learn optimal policies from expert demonstrations of interactions in multi-agent interactive domains. Despite existing guarantees on the performance of the resulting learned policies, characterizations of how far the learned polices are from a Nash equilibrium are missing for offline MA-IL. In this paper, we demonstrate impossibility and hardness results of learning low-exploitable policies in general $n$-player Markov Games. We do so by providing examples where even exact measure matching fails, and demonstrating a new hardness result on characterizing the Nash gap given a fixed measure matching error. We then show how these challenges can be overcome using strategic dominance assumptions on the expert equilibrium. Specifically, for the case of dominant strategy expert equilibria, assuming Behavioral Cloning error $ε_{\text{BC}}$, this provides a Nash imitation gap of $\mathcal{O}\left(nε_{\text{BC}}/(1-γ)^2\right)$ for a discount factor $γ$. We generalize this result with a new notion of best-response continuity, and argue that this is implicitly encouraged by standard regularization techniques.