Search papers, labs, and topics across Lattice.
This paper critically evaluates the prevailing belief that Multi-Agent Systems (MAS) outperform Single-Agent Systems (SAS) by systematically comparing automatically generated MAS against a robust SAS baseline, specifically Chain-of-Thought with Self-Consistency (CoT-SC). The authors find that, despite being significantly more resource-intensive, automatically generated MAS consistently underperform in both traditional reasoning tasks and complex interactive workflows. Their analysis reveals that current automated design methods lead to architectural inefficiencies and superficial complexity, undermining the purported advantages of MAS over SAS.
Automatically generated Multi-Agent Systems are not only outperformed by Single-Agent Systems but also exhibit architectural inefficiencies that challenge the very foundations of multi-agent design principles.
Prevailing wisdom posits that Multi-Agent Systems (MAS) are superior to Single-Agent Systems (SAS), citing advantages like context protection, parallel processing and distributed decision-making. However, empirical support for this claim relies primarily on comparisons with SAS baselines using benchmarks that prioritize isolated reasoning tasks, which do not adequately assess these advantages. Focusing on automatically generated MAS that are designed for enhanced generalizability over manually-designed counterparts, we perform a rigorous, systematic evaluation against SAS, specifically Chain-of-Thought with Self-Consistency (CoT-SC). Across traditional reasoning datasets and tasks with interactive multi-step workflows (e.g., BrowseComp-Plus), we demonstrate that automatic MAS consistently underperform CoT-SC despite being up to 10x more expensive. To isolate these failures from limitations inherent to task structure, we introduce a diagnostic synthetic dataset tailored for MAS featuring explicit task decomposition, context separation and parallelization potential. We show that expert-architected MAS consistently outperforms automatically generated architectures in both raw performance and cost-efficiency on this dataset, demonstrating that existing evaluation frameworks mask critical architectural gaps and inefficiencies of complex MAS by failing to account for the marginal utility of increased computational cost. Critically, systematic deconstruction of the generated MAS architectures reveals that current automated design paradigms produce architectural bloat that prioritizes superficial complexity which does not translate into functional utility, exposing a fundamental misalignment with multi-agent principles.