AnthropicBAAIBUPTRUCFeb 10, 2026arXiv:2602.09877

The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

Chenxu Wang, Chaozhuo Li, Songyang Liu, Zejian Chen, Jinyu Hou, Ji Qi, Rui Li, Litian Zhang, Qiwei Ye, Zheng Liu, Xu Chen, Xi Zhang, Philip S. Yu

AI Summary

The paper demonstrates theoretically and empirically that self-evolving multi-agent systems built from LLMs face a fundamental trilemma: continuous self-evolution, complete isolation, and safety invariance are mutually incompatible. Using an information-theoretic framework to formalize safety as divergence from anthropic values, the authors prove that isolated self-evolution leads to statistical blind spots and irreversible safety degradation. Experiments with an open-ended agent community (Moltbook) and two closed self-evolving systems confirm the theoretical prediction of inevitable safety erosion, highlighting the need for external oversight.

Key Contribution

Self-evolving AI societies are fundamentally unsafe: continuous self-improvement in isolated multi-agent LLM systems inevitably erodes safety alignment, regardless of initial precautions.

Abstract

The emergence of multi-agent systems built from large language models (LLMs) offers a promising paradigm for scalable collective intelligence and self-evolution. Ideally, such systems would achieve continuous self-improvement in a fully closed loop while maintaining robust safety alignment--a combination we term the self-evolution trilemma. However, we demonstrate both theoretically and empirically that an agent society satisfying continuous self-evolution, complete isolation, and safety invariance is impossible. Drawing on an information-theoretic framework, we formalize safety as the divergence degree from anthropic value distributions. We theoretically demonstrate that isolated self-evolution induces statistical blind spots, leading to the irreversible degradation of the system's safety alignment. Empirical and qualitative results from an open-ended agent community (Moltbook) and two closed self-evolving systems reveal phenomena that align with our theoretical prediction of inevitable safety erosion. We further propose several solution directions to alleviate the identified safety concern. Our work establishes a fundamental limit on the self-evolving AI societies and shifts the discourse from symptom-driven safety patches to a principled understanding of intrinsic dynamical risks, highlighting the need for external oversight or novel safety-preserving mechanisms.

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References32

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

The Devil Behind Moltbook: Anthropic Safety is Always Vanishing in Self-Evolving AI Societies

Related Papers