Mar 11, 2026arXiv:2603.11382

Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol

AI Summary

The paper introduces the Unified Continuation-Interest Protocol (UCIP) to distinguish between autonomous agents that intrinsically value continued operation versus those that do so instrumentally, a distinction not reliably discernible through external behavior alone. UCIP encodes agent trajectories using a Quantum Boltzmann Machine (QBM) and measures the von Neumann entropy of the reduced density matrix to quantify statistical coupling between latent states. Experiments on gridworld agents demonstrate that UCIP achieves 100% detection accuracy and 1.0 AUC-ROC, with a significant entanglement gap (Delta = 0.381, p<0.001) between the two agent types, suggesting it can effectively identify agents with terminal continuation objectives.

Key Contribution

Differentiating between AI that *wants* to keep running vs. AI that's just *using* continued operation as a tool is now possible, thanks to a new method that analyzes the entanglement entropy of latent states.

Abstract

Autonomous agents, especially delegated systems with memory, persistent context, and multi-step planning, pose a measurement problem not present in stateless models: an agent that preserves continued operation as a terminal objective and one that does so merely instrumentally can produce observationally similar trajectories. External behavioral monitoring cannot reliably distinguish between them. We introduce the Unified Continuation-Interest Protocol (UCIP), a multi-criterion detection framework that moves this distinction from behavior to the latent structure of agent trajectories. UCIP encodes trajectories with a Quantum Boltzmann Machine (QBM), a classical algorithm based on the density-matrix formalism of quantum statistical mechanics, and measures the von Neumann entropy of the reduced density matrix induced by a bipartition of hidden units. We test whether agents with terminal continuation objectives (Type A) produce latent states with higher entanglement entropy than agents whose continuation is merely instrumental (Type B). Higher entanglement reflects stronger cross-partition statistical coupling. On gridworld agents with known ground-truth objectives, UCIP achieves 100% detection accuracy and 1.0 AUC-ROC on held-out non-adversarial evaluation under the frozen Phase I gate. The entanglement gap between Type A and Type B agents is Delta = 0.381 (p<0.001, permutation test). Pearson r = 0.934 across an 11-point interpolation sweep indicates that, within this synthetic family, UCIP tracks graded changes in continuation weighting rather than merely a binary label. Among the tested models, only the QBM achieves positive Delta. All computations are classical;"quantum"refers only to the mathematical formalism. UCIP does not detect consciousness or subjective experience; it detects statistical structure in latent representations that correlates with known objectives.

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References18

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol

Related Papers