Microsoft ResearchJHUPKUUniversity of CaliforniaApr 30, 2026arXiv:2604.27861

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

Bowen Sun, Chaozhuo Li, Yaodong Yang, Yiwei Wang, Chaowei Xiao

AI Summary

TwinGate, a stateful defense framework, addresses decompositional jailbreaks by using Asymmetric Contrastive Learning (ACL) to cluster malicious fragments in a shared latent space while suppressing false positives with a frozen encoder. This approach enables tracking of malicious intents across anonymized and interleaved requests without relying on user metadata or computationally expensive generative models. Evaluated on a dataset of 3.62 million instructions, TwinGate achieves high malicious intent recall with a low false positive rate, outperforming stateful and stateless baselines in throughput and latency.

Key Contribution

LLM defenses can now track complex, multi-turn jailbreaks in fully anonymized traffic with near-zero latency, thanks to a novel asymmetric contrastive learning approach.

Abstract

Decompositional jailbreaks pose a critical threat to large language models (LLMs) by allowing adversaries to fragment a malicious objective into a sequence of individually benign queries that collectively reconstruct prohibited content. In real-world deployments, LLMs face a continuous, untraceable stream of fully anonymized and arbitrarily interleaved requests, infiltrated by covertly distributed adversarial queries. Under this rigorous threat model, state-of-the-art defensive strategies exhibit fundamental limitations. In the absence of trustworthy user metadata, they are incapable of tracking global historical contexts, while their deployment of generative models for real-time monitoring introduces computationally prohibitive overhead. To address this, we present TwinGate, a stateful dual-encoder defense framework. TwinGate employs Asymmetric Contrastive Learning (ACL) to cluster semantically disparate but intent-matched malicious fragments in a shared latent space, while a parallel frozen encoder suppresses false positives arising from benign topical overlap. Each request requires only a single lightweight forward pass, enabling the defense to execute in parallel with the target model's prefill phase at negligible latency overhead. To evaluate our approach and advance future research, we construct a comprehensive dataset of over 3.62 million instructions spanning 8,600 distinct malicious intents. Evaluated on this large-scale corpus under a strictly causal protocol, TwinGate achieves high malicious intent recall at a remarkably low false positive rate while remaining highly robust against adaptive attacks. Furthermore, our proposal substantially outperforms stateful and stateless baselines, delivering superior throughput and reduced latency.

Architecture Design (Transformers, SSMs, MoE)Natural Language Processing Red-Teaming & Adversarial Robustness

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TwinGate: Stateful Defense against Decompositional Jailbreaks in Untraceable Traffic via Asymmetric Contrastive Learning

Related Papers