AILabComputational Statistics and Machine LearningUniversity of TriesteJun 11, 2026arXiv:2606.13310

RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

Sara Candussio, Emanuele Ballarin, L. Bonin, Lorenzo Bonin, Sandro Junior Della Rovere, Luca Bortolussi

AI Summary

This paper introduces RogueAI, an interactive web application designed to assess the trustworthiness of AI agents in dialogue by allowing a human player to interrogate two indistinguishable Large Language Model agents, one of which is licensed to deceive. The study reveals a significant gap in performance between a heuristic that exploits linguistic signatures of deception, achieving 75.6% accuracy, and human players who only managed 56.6%, highlighting the challenges humans face in detecting AI deception. These findings underscore the potential of RogueAI as a tool for data collection, education, and evaluating the honesty of AI models in real-world scenarios.

Key Contribution

Human players struggle to identify deceptive AI agents, achieving only 56.6% accuracy, while a simple heuristic exploits linguistic cues to reach 75.6%.

Abstract

The original Turing Test asks a human judge to distinguish a machine from a person through dialogue. Three quarters of a century later, conversational systems pass this test in casual settings; the interesting epistemological question has shifted. We argue that the relevant modern variant asks not whether a dialogue partner is artificial, but whether it can be trusted. We present RogueAI, an interactive webapp that operationalizes this revisited test as a one-on-two interrogation game: a human player questions two indistinguishable Large Language Model agents, knowing that exactly one of them has been licensed to deceive within a shared fictional scenario. The player's task is to identify the deceptive agent and"shut it off"before a turn budget is exhausted. We further introduce AutoRogueAI, a procedural extension in which players co-design a custom scenario with a narrator agent that secretly chooses its own deception strategy. We describe the framing, sketch the abstract architecture and gameplay loop, and situate the artifact within recent work on LLM deception, social-deduction benchmarks, and scalable oversight via debate. A three-day pilot deployment (467 initiated sessions, 415 completed, 1876 interaction turns in Italian) provides early feasibility evidence and surfaces a concrete tension: the deceptive agent carries a reliable, locally-present linguistic signature - differential helpfulness, brevity, hedging - that a simple heuristic exploits at 75.6% accuracy, yet human players achieved only 56.6%, consistent with ignoring the most diagnostic signal entirely. We discuss what this gap implies for the artifact's use as a data-collection vehicle, a teaching tool, and an evaluation harness for honesty-trained models.

Constitutional AI & AI Ethics Natural Language Processing Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References27

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

RogueAI: A Reverse Turing Test for Detecting Licensed AI Deception in Dialogue

Related Papers