Mar 5, 2026arXiv:2603.05494

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Helena Casademunt, Helena Casademunt, Bartosz Cywiński, Bartosz Cywi'nski, K. Tran, Khoi Tran, Arya Jakkli, Arya Jakkli, Samuel Marks, Samuel Marks, Neel Nanda, Neel Nanda

AI Summary

The paper introduces censored LLMs, specifically Chinese open-weight models trained to avoid politically sensitive topics, as a novel testbed for evaluating honesty elicitation and lie detection techniques. They find that sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data are the most effective elicitation methods, while self-classification prompting achieves near-optimal lie detection performance. The study reveals that while these techniques improve truthfulness, they do not eliminate false responses entirely, and some elicitation methods transfer to other frontier models.

Key Contribution

Censored LLMs offer a surprisingly natural and effective environment for stress-testing methods that aim to elicit truthfulness and detect deception.

Abstract

Large language models sometimes produce false or misleading responses. Two approaches to this problem are honesty elicitation -- modifying prompts or weights so that the model answers truthfully -- and lie detection -- classifying whether a given response is false. Prior work evaluates such methods on models specifically trained to lie or conceal information, but these artificial constructions may not resemble naturally-occurring dishonesty. We instead study open-weights LLMs from Chinese developers, which are trained to censor politically sensitive topics: Qwen3 models frequently produce falsehoods about subjects like Falun Gong or the Tiananmen protests while occasionally answering correctly, indicating they possess knowledge they are trained to suppress. Using this as a testbed, we evaluate a suite of elicitation and lie detection techniques. For honesty elicitation, sampling without a chat template, few-shot prompting, and fine-tuning on generic honesty data most reliably increase truthful responses. For lie detection, prompting the censored model to classify its own responses performs near an uncensored-model upper bound, and linear probes trained on unrelated data offer a cheaper alternative. The strongest honesty elicitation techniques also transfer to frontier open-weights models including DeepSeek R1. Notably, no technique fully eliminates false responses. We release all prompts, code, and transcripts.

Eval Frameworks & Benchmarks Open-Source Models & Weights Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation

Related Papers