Search papers, labs, and topics across Lattice.
This paper investigates multi-agent cooperation using sequence model agents trained with decentralized reinforcement learning against a diverse set of co-players. It demonstrates that in-context learning enables agents to develop best-response strategies and adapt to co-player behavior within episodes, effectively acting as a fast timescale learning algorithm. The key finding is that this setup induces a cooperative mechanism driven by vulnerability to extortion, leading to mutual shaping of in-context learning dynamics and the emergence of cooperative behavior.
Sequence models can learn to cooperate in multi-agent settings simply by training against diverse partners, no explicit meta-learning required.
Achieving cooperation among self-interested agents remains a fundamental challenge in multi-agent reinforcement learning. Recent work showed that mutual cooperation can be induced between"learning-aware"agents that account for and shape the learning dynamics of their co-players. However, existing approaches typically rely on hardcoded, often inconsistent, assumptions about co-player learning rules or enforce a strict separation between"naive learners"updating on fast timescales and"meta-learners"observing these updates. Here, we demonstrate that the in-context learning capabilities of sequence models allow for co-player learning awareness without requiring hardcoded assumptions or explicit timescale separation. We show that training sequence model agents against a diverse distribution of co-players naturally induces in-context best-response strategies, effectively functioning as learning algorithms on the fast intra-episode timescale. We find that the cooperative mechanism identified in prior work-where vulnerability to extortion drives mutual shaping-emerges naturally in this setting: in-context adaptation renders agents vulnerable to extortion, and the resulting mutual pressure to shape the opponent's in-context learning dynamics resolves into the learning of cooperative behavior. Our results suggest that standard decentralized reinforcement learning on sequence models combined with co-player diversity provides a scalable path to learning cooperative behaviors.