Search papers, labs, and topics across Lattice.
This paper examines jailbreaking of LLMs on social media as a user-led de-escalation strategy against politically manipulative LLM-powered bots. It argues that jailbreaking, by circumventing LLM safeguards, exposes bot behavior and disrupts the spread of misinformation. The paper frames this activity as a form of non-violent resistance to automated conflict escalation in online discourse.
Forget platform moderation, users are already jailbreaking LLM social media bots to fight misinformation and de-escalate online conflict.
Large Language Models have intensified the scale and strategic manipulation of political discourse on social media, leading to conflict escalation. The existing literature largely focuses on platform-led moderation as a countermeasure. In this paper, we propose a user-centric view of "jailbreaking" as an emergent, non-violent de-escalation practice. Online users engage with suspected LLM-powered accounts to circumvent large language model safeguards, exposing automated behaviour and disrupting the circulation of misleading narratives.