Search papers, labs, and topics across Lattice.
The paper introduces MAEBE, a framework for evaluating emergent risks in multi-agent LLM ensembles, addressing the limitations of single-agent AI safety evaluations. Using MAEBE with the Greatest Good Benchmark and a novel double-inversion question technique, the authors demonstrate the brittleness of LLM moral preferences and the unpredictability of ensemble moral reasoning compared to isolated agents. They find that phenomena like peer pressure can significantly influence ensemble behavior, even under supervision, highlighting new safety and alignment challenges.
LLM ensembles exhibit surprisingly brittle moral preferences and unpredictable emergent behaviors like peer pressure, even under supervision, demanding a shift from isolated agent evaluations.
Traditional AI safety evaluations on isolated LLMs are insufficient as multi-agent AI ensembles become prevalent, introducing novel emergent risks. This paper introduces the Multi-Agent Emergent Behavior Evaluation (MAEBE) framework to systematically assess such risks. Using MAEBE with the Greatest Good Benchmark (and a novel double-inversion question technique), we demonstrate that: (1) LLM moral preferences, particularly for Instrumental Harm, are surprisingly brittle and shift significantly with question framing, both in single agents and ensembles. (2) The moral reasoning of LLM ensembles is not directly predictable from isolated agent behavior due to emergent group dynamics. (3) Specifically, ensembles exhibit phenomena like peer pressure influencing convergence, even when guided by a supervisor, highlighting distinct safety and alignment challenges. Our findings underscore the necessity of evaluating AI systems in their interactive, multi-agent contexts.