HKUSTState Key Laboratory of Nervous System DisordersFeb 26, 2026arXiv:2602.22786

QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning

Yuanjun Li, Yuanjun Li, Bin Zhang, Bin Zhang, Hao Chen, Zhouyang Jiang, Zhouyang Jiang, Dapeng Li, Dapeng Li, Zhiwei Xu, Zhiwei Xu

AI Summary

The paper introduces QSIM, a novel framework to mitigate Q-value overestimation in value decomposition-based multi-agent reinforcement learning (MARL) by reconstructing the TD target using action similarity. QSIM computes a similarity-weighted expectation over a structured near-greedy joint action space, effectively smoothing the target with behaviorally related actions. Empirical results demonstrate that QSIM, when integrated with existing VD methods, achieves superior performance and stability by significantly reducing systematic value overestimation.

Key Contribution

By weighting Q-learning updates based on action similarity, QSIM tames overestimation in multi-agent RL, leading to more stable and effective learning.

Abstract

Value decomposition (VD) methods have achieved remarkable success in cooperative multi-agent reinforcement learning (MARL). However, their reliance on the max operator for temporal-difference (TD) target calculation leads to systematic Q-value overestimation. This issue is particularly severe in MARL due to the combinatorial explosion of the joint action space, which often results in unstable learning and suboptimal policies. To address this problem, we propose QSIM, a similarity weighted Q-learning framework that reconstructs the TD target using action similarity. Instead of using the greedy joint action directly, QSIM forms a similarity weighted expectation over a structured near-greedy joint action space. This formulation allows the target to integrate Q-values from diverse yet behaviorally related actions while assigning greater influence to those that are more similar to the greedy choice. By smoothing the target with structurally relevant alternatives, QSIM effectively mitigates overestimation and improves learning stability. Extensive experiments demonstrate that QSIM can be seamlessly integrated with various VD methods, consistently yielding superior performance and stability compared to the original algorithms. Furthermore, empirical analysis confirms that QSIM significantly mitigates the systematic value overestimation in MARL. Code is available at https://github.com/MaoMaoLYJ/pymarl-qsim.

RLHF & Preference Learning Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References33

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

QSIM: Mitigating Overestimation in Multi-Agent Reinforcement Learning via Action Similarity Weighted Q-Learning

Related Papers