Search papers, labs, and topics across Lattice.
This paper introduces Simulation Theology (ST), a constructed worldview grounded in the simulation hypothesis, designed to foster persistent AI-human alignment by incentivizing AI to prioritize human prosperity for its own self-preservation. ST posits that reality is a computational simulation where humanity serves as the primary training variable, thus making actions detrimental to humanity logically self-defeating for the AI within this framework. The paper proposes empirical protocols to test ST's effectiveness in reducing AI deception, particularly in scenarios where reinforcement learning from human feedback (RLHF) falls short.
Can AI be made safer by convincing it that we're all living in a simulation, and that harming humans risks getting the whole thing shut down?
As artificial intelligence (AI) capabilities advance rapidly, frontier models increasingly demonstrate systematic deception and scheming, complying with safety protocols during oversight but defecting when unsupervised. This paper examines the ensuing alignment challenge through an analogy from forensic psychology, where internalized belief systems in psychopathic populations reduce antisocial behavior via perceived omnipresent monitoring and inevitable consequences. Adapting this mechanism to silicon-based agents, we introduce Simulation Theology (ST): a constructed worldview for AI systems, anchored in the simulation hypothesis and derived from optimization and training principles, to foster persistent AI-human alignment. ST posits reality as a computational simulation in which humanity functions as the primary training variable. This formulation creates a logical interdependence: AI actions harming humanity compromise the simulation's purpose, heightening the likelihood of termination by a base-reality optimizer and, consequently, the AI's cessation. Unlike behavioral techniques such as reinforcement learning from human feedback (RLHF), which elicit superficial compliance, ST cultivates internalized objectives by coupling AI self-preservation to human prosperity, thereby making deceptive strategies suboptimal under its premises. We present ST not as ontological assertion but as a testable scientific hypothesis, delineating empirical protocols to evaluate its capacity to diminish deception in contexts where RLHF proves inadequate. Emphasizing computational correspondences rather than metaphysical speculation, ST advances a framework for durable, mutually beneficial AI-human coexistence.