Search papers, labs, and topics across Lattice.
This paper investigates the application of reinforcement learning (RL) to optimize the orchestration of LLM-based multi-agent systems, focusing on temporal interaction graphs called orchestration traces. The authors identify key technical axes for RL in this context, including reward design, credit assignment, and sub-decisions related to agent spawning, delegation, communication, aggregation, and stopping. Their analysis of academic methods and public industrial evidence reveals a gap in explicit RL training methods for the stopping decision and a broader scale gap between academic evaluation and industrial deployment.
Turns out, nobody's explicitly RL-training LLM agents when to *stop* in multi-agent systems, despite its critical role in efficiency and cost.
As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems through orchestration traces: temporal interaction graphs whose events include sub-agent spawning, delegation, communication, tool use, return, aggregation, and stopping decisions. Using this lens, we identify three technical axes. First, reward design spans eight families, including orchestration rewards for parallelism speedup, split correctness, and aggregation quality. Second, reward and credit signals attach to eight credit- or signal-bearing units from token to team; explicit counterfactual message-level credit remains especially sparse in our curated pool. Third, orchestration learning decomposes into five sub-decisions: when to spawn, whom to delegate to, how to communicate, how to aggregate, and when to stop. In our curated pool as of May 4, 2026, we found no explicit RL training method for the stopping decision. We connect academic methods to public industrial evidence from Kimi Agent Swarm, OpenAI Codex, and Anthropic Claude Code. The resulting scale gap is a gap between publicly reported deployment envelopes and open academic evaluation regimes, not independent verification of industrial training traces. We release the artifact at https://github.com/xxzcc/awesome-llm-mas-rl, including an 84-entry tagged paper pool, a 32-record exclusion log, scripted corpus statistics, and a minimal JSON schema for replayable orchestration traces.