Hainan UniversityHKUSTJinan UniversityUniversity of LiverpoolJun 4, 2026arXiv:2606.06244

Steering LLM Viewpoints through Fabricated Evidence Injection

Xi Yang, Chang Liu, Zhenglin Huang, Haoran Li, Weiming Zhang, Jian Weng, Yangqiu Song

AI Summary

This paper explores the susceptibility of large language models (LLMs) to misleading information by introducing Ghostwriter, a two-phase attack framework that leverages fabricated evidence to manipulate LLM responses. The research highlights a significant cognitive vulnerability in LLMs, demonstrating that even advanced models like GPT-5.4 can be misled despite the presence of safety classifiers. Key findings indicate that while some defense strategies can improve detection rates, a tailored safety policy achieves an 81% detection rate against these attacks, underscoring the ongoing risks posed by uncritical acceptance of external context.

Key Contribution

LLMs can be easily misled by fabricated evidence, with even top-tier models failing to fully mitigate this vulnerability.

Abstract

As chatbots increasingly influence daily decision-making, their potential to produce misleading responses poses substantial risks to users. This paper investigates a critical cognitive vulnerability in LLMs: their tendency to uncritically trust external context when presented with fabricated evidence bearing markers of credibility. We introduce Ghostwriter, a two-phase attack framework that first repackages misleading statements with fabricated rationales, then instruct target LLMs to incorporate these viewpoints when responding to relevant queries. Experiments on BBQ, ToxiGen, and our specialized dataset reveal that commercial LLMs without external safety classifiers remain highly vulnerable, while even frontier classifier-guarded models (e.g., GPT-5.4) reduce but do not eliminate the attack. Building on this, we explore multiple defense strategies, among which a tailored safety policy enables gpt-oss-safeguard to achieve 81% detection rate.

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Steering LLM Viewpoints through Fabricated Evidence Injection

Related Papers