Search papers, labs, and topics across Lattice.
This paper introduces THRD, a novel training-free framework designed to defend against multi-turn jailbreak attacks on large language models (LLMs) by modeling the temporal risk accumulation inherent in conversational dynamics. Unlike existing defenses that either require costly retraining or evaluate each turn in isolation, THRD employs a comprehensive approach that integrates multiple modules to assess risk across dialogue history. Experimental results demonstrate that THRD effectively reduces attack success rates to between 0.2% and 4.0%, while maintaining model performance with only a 1.5% degradation in utility on benchmark tasks.
Multi-turn jailbreak attacks can be thwarted with a training-free framework that reduces their success rate to as low as 0.2%, all while preserving model utility.
Multi-turn jailbreak attacks pose a growing threat to LLMs by exploiting conversational dynamics such as gradual escalation and cross-turn coordination. Existing defenses either rely on costly retraining -- often degrading model utility -- or apply single-turn analysis independently at each turn, failing to capture how risk accumulates along interaction trajectories. We observe that safety behavior in multi-turn interaction is trajectory-dependent: dialogue history continuously reshapes the model's conditioning context, making it insufficient to evaluate each turn in isolation. Motivated by this insight, we present THRD, the first training-free framework that explicitly models temporal risk accumulation for multi-turn jailbreak defense. THRD integrates four modules: a Turn-level Risk Assessor (TRA) for instantaneous risk estimation, a Historical Context Analyzer (HCA) for cross-turn intent escalation detection, a Response Evaluator (RE) for identifying facilitative outputs, and a Decision Module that combines these signals through a time-evolving scoring mechanism with attenuation-based modulation and trend-aware adjustment. Experiments against state-of-the-art multi-turn attacks -- including tree-search-based and multi-agent collaborative methods -- across two target models show that THRD reduces ASR to 0.2--4.0% while preserving model utility within 1.5% degradation on MMLU and GSM8K. Ablation studies confirm non-redundant module contributions and stable cross-architecture generalization. Analysis of first rejection triggers reveals that over 70% of multi-turn attacks require Turn~2 or later to detect, validating the necessity of explicit temporal aggregation.