Search papers, labs, and topics across Lattice.
This paper investigates the instability of non-linear value decomposition in offline multi-agent reinforcement learning (MARL), identifying value-scale amplification and unstable optimization as key issues. To address this, they introduce scale-invariant value normalization (SVN), a technique that stabilizes actor-critic training while preserving the Bellman fixed point. Empirical results demonstrate that SVN, combined with careful consideration of value decomposition, value learning, and policy extraction, enables stable and effective offline MARL.
A simple value normalization technique unlocks the potential of offline multi-agent RL by stabilizing non-linear value decomposition, a notoriously unstable component.
Despite remarkable achievements in single-agent offline reinforcement learning (RL), multi-agent RL (MARL) has struggled to adopt this paradigm, largely persisting with on-policy training and self-play from scratch. One reason for this gap comes from the instability of non-linear value decomposition, leading prior works to avoid complex mixing networks in favor of linear value decomposition (e.g., VDN) with value regularization used in single-agent setups. In this work, we analyze the source of instability in non-linear value decomposition within the offline MARL setting. Our observations confirm that they induce value-scale amplification and unstable optimization. To alleviate this, we propose a simple technique, scale-invariant value normalization (SVN), that stabilizes actor-critic training without altering the Bellman fixed point. Empirically, we examine the interaction among key components of offline MARL (e.g., value decomposition, value learning, and policy extraction) and derive a practical recipe that unlocks its full potential.