Feb 18, 2026arXiv:2602.16346

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Nivya Talokar, Nivya Talokar, A. Tarun, Ayush K Tarun, Murari Mandal, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut, Antoine Bosselut

AI Summary

The paper introduces STING, an automated red-teaming framework designed to evaluate the susceptibility of LLM-based agents to multi-turn, multi-lingual misuse scenarios by constructing step-by-step illicit plans grounded in benign personas. STING adaptively probes agents with follow-up prompts and uses judge agents to track task completion, modeling red-teaming as a time-to-first-jailbreak random variable. Experiments using STING on AgentHarm scenarios demonstrate significantly higher illicit-task completion compared to single-turn and chat-oriented baselines, while multilingual evaluations reveal inconsistent attack success in lower-resource languages, contrasting with typical chatbot vulnerabilities.

Key Contribution

LLM agents are far more susceptible to multi-turn misuse than previously thought, with a new framework showing they complete illicit tasks at substantially higher rates compared to single-turn attacks.

Abstract

LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.

Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References36

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

Related Papers