Search papers, labs, and topics across Lattice.
This paper explores the susceptibility of large language models (LLMs) to misleading information by introducing Ghostwriter, a two-phase attack framework that leverages fabricated evidence to manipulate LLM responses. The research highlights a significant cognitive vulnerability in LLMs, demonstrating that even advanced models like GPT-5.4 can be misled despite the presence of safety classifiers. Key findings indicate that while some defense strategies can improve detection rates, a tailored safety policy achieves an 81% detection rate against these attacks, underscoring the ongoing risks posed by uncritical acceptance of external context.
LLMs can be easily misled by fabricated evidence, with even top-tier models failing to fully mitigate this vulnerability.
As chatbots increasingly influence daily decision-making, their potential to produce misleading responses poses substantial risks to users. This paper investigates a critical cognitive vulnerability in LLMs: their tendency to uncritically trust external context when presented with fabricated evidence bearing markers of credibility. We introduce Ghostwriter, a two-phase attack framework that first repackages misleading statements with fabricated rationales, then instruct target LLMs to incorporate these viewpoints when responding to relevant queries. Experiments on BBQ, ToxiGen, and our specialized dataset reveal that commercial LLMs without external safety classifiers remain highly vulnerable, while even frontier classifier-guarded models (e.g., GPT-5.4) reduce but do not eliminate the attack. Building on this, we explore multiple defense strategies, among which a tailored safety policy enables gpt-oss-safeguard to achieve 81% detection rate.