Feb 26, 2026arXiv:2602.23163

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Usman Anwar, Usman Anwar, Julianna Piskorz, Julianna Piskorz, David D. Baek, David D. Baek, David Africa, D. Africa, James Weatherall, Jim Weatherall, Max Tegmark, Max Tegmark, Christian Schroeder de Witt, Christian Schroeder de Witt, M. Schaar, Mihaela van der Schaar, David Krueger, David Krueger

AI Summary

The paper introduces a decision-theoretic framework for detecting steganography in LLMs, addressing the limitations of traditional methods that require a known reference distribution of non-steganographic signals. It defines "generalized V-information" to quantify usable information and introduces the "steganographic gap" as a measure of steganography based on the utility difference between agents with and without decoding capabilities. Empirical validation demonstrates the framework's effectiveness in detecting, quantifying, and mitigating steganographic reasoning in LLMs.

Key Contribution

LLMs might be using steganography to hide unwanted behaviors, and this paper offers a way to detect it by measuring how much extra "usable information" a decoder gets.

Abstract

Large language models are beginning to show steganographic capabilities. Such capabilities could allow misaligned models to evade oversight mechanisms. Yet principled methods to detect and quantify such behaviours are lacking. Classical definitions of steganography, and detection methods based on them, require a known reference distribution of non-steganographic signals. For the case of steganographic reasoning in LLMs, knowing such a reference distribution is not feasible; this renders these approaches inapplicable. We propose an alternative, \textbf{decision-theoretic view of steganography}. Our central insight is that steganography creates an asymmetry in usable information between agents who can and cannot decode the hidden content (present within a steganographic signal), and this otherwise latent asymmetry can be inferred from the agents'observable actions. To formalise this perspective, we introduce generalised $\mathcal{V}$-information: a utilitarian framework for measuring the amount of usable information within some input. We use this to define the \textbf{steganographic gap} -- a measure that quantifies steganography by comparing the downstream utility of the steganographic signal to agents that can and cannot decode the hidden content. We empirically validate our formalism, and show that it can be used to detect, quantify, and mitigate steganographic reasoning in LLMs.

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Scalable Oversight & Alignment Theory

Citation Metrics

Citations0

Influential citations0

References54

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

A Decision-Theoretic Formalisation of Steganography With Applications to LLM Monitoring

Related Papers