Search papers, labs, and topics across Lattice.
The paper introduces QSIM, a novel framework to mitigate Q-value overestimation in value decomposition-based multi-agent reinforcement learning (MARL) by reconstructing the TD target using action similarity. QSIM computes a similarity-weighted expectation over a structured near-greedy joint action space, effectively smoothing the target with behaviorally related actions. Empirical results demonstrate that QSIM, when integrated with existing VD methods, achieves superior performance and stability by significantly reducing systematic value overestimation.
By weighting Q-learning updates based on action similarity, QSIM tames overestimation in multi-agent RL, leading to more stable and effective learning.
Value decomposition (VD) methods have achieved remarkable success in cooperative multi-agent reinforcement learning (MARL). However, their reliance on the max operator for temporal-difference (TD) target calculation leads to systematic Q-value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near-greedy joint action space. This formulation allows the target to integrate Q-values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at https://github.com/MaoMaoLYJ/pymarl-qsim.