UWSUTDUniversity of CaliforniaUniversity of California Los AngelesApr 1, 2026arXiv:2604.00344

Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning

Eric Hanchen Jiang, Levina Li, Xiao Liang, Yubei Li, Yuchen Wu, Haozheng Luo, Hengli Li, Zhi Zhang, Zhaolu Kang, Kai-Wei Chang, Ying Nian Wu

AI Summary

Agent Q-Mix is introduced as a reinforcement learning framework that optimizes the selection and interconnection of LLM agents for complex problem-solving by framing topology selection as a cooperative MARL problem. It uses QMIX value factorization to learn decentralized communication decisions, combining a topology-aware GNN encoder, GRU memory, and per-agent Q-heads under a CTDE paradigm, balancing task accuracy with token cost. Experiments across coding, reasoning, and mathematics benchmarks demonstrate that Agent Q-Mix achieves higher accuracy, token efficiency, and robustness compared to existing methods, including a 20.8% accuracy on Humanity's Last Exam using Gemini-3.1-Flash-Lite.

Key Contribution

Forget hand-designed agent communication topologies: Agent Q-Mix learns decentralized communication strategies that boost accuracy and token efficiency in LLM multi-agent systems.

Abstract

Large Language Models (LLMs) have shown remarkable performance in completing various tasks. However, solving complex problems often requires the coordination of multiple agents, raising a fundamental question: how to effectively select and interconnect these agents. In this paper, we propose \textbf{Agent Q-Mix}, a reinforcement learning framework that reformulates topology selection as a cooperative Multi-Agent Reinforcement Learning (MARL) problem. Our method learns decentralized communication decisions using QMIX value factorization, where each agent selects from a set of communication actions that jointly induce a round-wise communication graph. At its core, Agent Q-Mix combines a topology-aware GNN encoder, GRU memory, and per-agent Q-heads under a Centralized Training with Decentralized Execution (CTDE) paradigm. The framework optimizes a reward function that balances task accuracy with token cost. Across seven core benchmarks in coding, reasoning, and mathematics, Agent Q-Mix achieves the highest average accuracy compared to existing methods while demonstrating superior token efficiency and robustness against agent failure. Notably, on the challenging Humanity's Last Exam (HLE) using Gemini-3.1-Flash-Lite as a backbone, Agent Q-Mix achieves 20.8\% accuracy, outperforming Microsoft Agent Framework (19.2\%) and LangGraph (19.2\%), followed by AutoGen and Lobster by OpenClaw. These results underscore the effectiveness of learned, decentralized topology optimization in pushing the boundaries of multi-agent reasoning.

Reasoning & Chain-of-Thought RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Agent Q-Mix: Selecting the Right Action for LLM Multi-Agent Systems through Reinforcement Learning

Related Papers