Search papers, labs, and topics across Lattice.
MF-Diffuser tackles offline multi-agent RL by planning in the Wasserstein space of trajectory distributions, using a value-weighted chaotic entropy objective and hierarchical coarse-to-fine denoising. This approach leverages mean-field theory to represent the full population dynamics with a small subset of agents, mitigating the curse of dimensionality. Theoretical analysis provides suboptimality bounds and Nash equilibrium convergence guarantees, while experiments demonstrate superior performance, especially with suboptimal data and large agent populations (N >= 10^3).
Scaling offline MARL to thousands of agents is now tractable: MF-Diffuser uses mean-field theory to plan in trajectory distribution space, sidestepping the curse of dimensionality.
Diffusion-based planning has achieved strong results in single-agent offline reinforcement learning, yet scaling to many-agent systems remains intractable due to the curse of dimensionality in the joint trajectory space. We introduce MF-Diffuser, a framework that lifts trajectory planning to the Wasserstein space of trajectory distributions, where the propagation of chaos ensures a small representative subset of agents captures the full population dynamics. Our approach features a value-weighted chaotic entropy objective that reconciles generative fidelity with return maximization, and a hierarchical coarse-to-fine strategy that progressively grows the agent population during denoising. We establish end-to-end suboptimality bounds with four interpretable terms, revealing that mean-field approximation error scales as $O(H^2/\sqrt{N})$ while offline distribution shift provably does not grow with population size $N$, and prove the generated policy is an approximate mean-field Nash equilibrium with explicit convergence guarantees. Experiments on three mean-field RL benchmarks -- spanning stage games, sequential dynamics, and adversarial team competition -- show MF-Diffuser achieves the best return in the majority of settings, with the largest gains on suboptimal offline data and at extreme scales ($N \geq 10^3$).