UNCApr 13, 2026arXiv:2604.11666

Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

Hanqi Xiao, Vaidehi Patil, Zaid Khan, Hyunji Lee, Elias Stengel-Eskin, Mohit Bansal

AI Summary

The paper introduces ToM for Steering Beliefs (ToM-SB), a novel challenge where a defender LLM acts as a double agent to mislead an attacker about sensitive information. They find that state-of-the-art LLMs struggle with this task, even with theory-of-mind prompting. Reinforcement learning is then used to train AI Double Agents, demonstrating a bidirectional relationship between theory-of-mind and attacker-fooling, with the best performance achieved by combining both ToM and fooling rewards.

Key Contribution

Even with ToM prompting, today's LLMs can be easily fooled in simple privacy games, but RL-trained "double agents" learn to effectively mislead attackers by modeling their beliefs.

Abstract

As large language models (LLMs) become the engine behind conversational systems, their ability to reason about the intentions and states of their dialogue partners (i.e., form and use a theory-of-mind, or ToM) becomes increasingly critical for safe interaction with potentially adversarial partners. We propose a novel privacy-themed ToM challenge, ToM for Steering Beliefs (ToM-SB), in which a defender must act as a Double Agent to steer the beliefs of an attacker with partial prior knowledge within a shared universe. To succeed on ToM-SB, the defender must engage with and form a ToM of the attacker, with a goal of fooling the attacker into believing they have succeeded in extracting sensitive information. We find that strong frontier models like Gemini3-Pro and GPT-5.4 struggle on ToM-SB, often failing to fool attackers in hard scenarios with partial attacker prior knowledge, even when prompted to reason about the attacker's beliefs (ToM prompting). To close this gap, we train models on ToM-SB to act as AI Double Agents using reinforcement learning, testing both fooling and ToM rewards. Notably, we find a bidirectionally emergent relationship between ToM and attacker-fooling: rewarding fooling success alone improves ToM, and rewarding ToM alone improves fooling. Across four attackers with different strengths, six defender methods, and both in-distribution and out-of-distribution (OOD) evaluation, we find that gains in ToM and attacker-fooling are well-correlated, highlighting belief modeling as a key driver of success on ToM-SB. AI Double Agents that combine both ToM and fooling rewards yield the strongest fooling and ToM performance, outperforming Gemini3-Pro and GPT-5.4 with ToM prompting on hard scenarios. We also show that ToM-SB and AI Double Agents can be extended to stronger attackers, demonstrating generalization to OOD settings and the upgradability of our task.

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References51

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Playing Along: Learning a Double-Agent Defender for Belief Steering via Theory of Mind

Related Papers