Search papers, labs, and topics across Lattice.
This paper introduces RQRE-OVI, an optimistic value iteration algorithm, to compute Risk-Sensitive Quantal Response Equilibrium (RQRE) in multi-agent reinforcement learning with linear function approximation. RQRE offers a unique, smooth solution under bounded rationality and risk sensitivity, addressing the limitations of Nash equilibrium in general-sum Markov games. The authors provide finite-sample regret analysis, demonstrating a tradeoff between rationality (tightening regret) and risk sensitivity (enhancing stability), and empirically validate RQRE-OVI's robustness under cross-play compared to Nash-based methods.
Ditch brittle Nash equilibria: a new algorithm finds more robust MARL policies by tuning risk sensitivity and rationality.
Provably efficient and robust equilibrium computation in general-sum Markov games remains a core challenge in multi-agent reinforcement learning. Nash equilibrium is computationally intractable in general and brittle due to equilibrium multiplicity and sensitivity to approximation error. We study Risk-Sensitive Quantal Response Equilibrium (RQRE), which yields a unique, smooth solution under bounded rationality and risk sensitivity. We propose \texttt{RQRE-OVI}, an optimistic value iteration algorithm for computing RQRE with linear function approximation in large or continuous state spaces. Through finite-sample regret analysis, we establish convergence and explicitly characterize how sample complexity scales with rationality and risk-sensitivity parameters. The regret bounds reveal a quantitative tradeoff: increasing rationality tightens regret, while risk sensitivity induces regularization that enhances stability and robustness. This exposes a Pareto frontier between expected performance and robustness, with Nash recovered in the limit of perfect rationality and risk neutrality. We further show that the RQRE policy map is Lipschitz continuous in estimated payoffs, unlike Nash, and RQRE admits a distributionally robust optimization interpretation. Empirically, we demonstrate that \texttt{RQRE-OVI} achieves competitive performance under self-play while producing substantially more robust behavior under cross-play compared to Nash-based approaches. These results suggest \texttt{RQRE-OVI} offers a principled, scalable, and tunable path for equilibrium learning with improved robustness and generalization.