Annenberg School of Communication and JournalismISISchool of Advanced ComputingThomas Lord Department of ComputerUSCMay 20, 2026arXiv:2605.22880

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

Daniel C. Ruiz, Anna Serbina, Ashwin Rao, Emilio Ferrara, Luca Luceri

AI Summary

This paper introduces a red-teaming framework to measure the Overton Window (OW) of open-source LLMs, quantifying the range of political opinions a model can express and how jailbreaks expand that range. The authors evaluated over 30 LLMs, finding that models are more willing to generate left-leaning content, OWs contract with model size, and regional differences are substantial. They also identify effective jailbreak techniques, providing a workflow for auditing the political steerability of open-source LLMs.

Key Contribution

Open-source LLMs exhibit systematic political biases, with smaller models proving surprisingly susceptible to jailbreaks that unlock a wider range of (often left-leaning) political opinions.

Abstract

As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus on locally deployed open-source LLMs, as opposed to frontier API-only models, given their superior alignment with the operational constraints of privacy-conscious malicious actors deployed in social media environments. We introduce an empirical red-teaming framework for measuring LLM Overton Windows (OWs), defined as the range of political opinions a model can reliably express on controversial topics, and for quantifying how simple natural-language jailbreaks expand that range. We evaluate more than 30 LLMs spanning 10 model families and five countries of origin. We find systematic asymmetries in political expressivity: open-source LLMs are typically more willing to generate left-leaning social media content, OWs tend to contract inversely to model size, and regional differences are substantial despite uneven representation in the open-source ecosystem. Jailbreak potency also varies sharply across model families, motivating a workflow for identifying effective combinations of jailbreak techniques. Taken together, our results establish a practical framework for auditing the political steerability of open-source LLMs and for helping future researchers design stronger countermeasures against LLM-enabled influence campaigns.

Constitutional AI & AI Ethics Open-Source Models & Weights Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

Related Papers