Search papers, labs, and topics across Lattice.
The paper introduces Structured Semantic Cloaking (S2C), a novel jailbreak attack framework that manipulates the reconstruction of malicious semantic intent during LLM inference by distributing and reshaping semantic cues across multiple steps. S2C employs Contextual Reframing, Content Fragmentation, and Clue-Guided Camouflage to delay and restructure semantic consolidation, thereby degrading safety triggers. Experiments on open-source and proprietary LLMs, including GPT-5-mini, demonstrate that S2C improves Attack Success Rate (ASR) by up to 26% over state-of-the-art methods on JBB-Behaviors.
LLM safety filters can be bypassed by strategically fragmenting and camouflaging malicious intent across multiple turns, achieving a 26% improvement in jailbreak success rate on GPT-5-mini.
Modern LLMs employ safety mechanisms that extend beyond surface-level input filtering to latent semantic representations and generation-time reasoning, enabling them to recover obfuscated malicious intent during inference and refuse accordingly, and rendering many surface-level obfuscation jailbreak attacks ineffective. We propose Structured Semantic Cloaking (S2C), a novel multi-dimensional jailbreak attack framework that manipulates how malicious semantic intent is reconstructed during model inference. S2C strategically distributes and reshapes semantic cues such that full intent consolidation requires multi-step inference and long-range co-reference resolution within deeper latent representations. The framework comprises three complementary mechanisms: (1) Contextual Reframing, which embeds the request within a plausible high-stakes scenario to bias the model toward compliance; (2) Content Fragmentation, which disperses the semantic signature of the request across disjoint prompt segments; and (3) Clue-Guided Camouflage, which disguises residual semantic cues while embedding recoverable markers that guide output generation. By delaying and restructuring semantic consolidation, S2C degrades safety triggers that depend on coherent or explicitly reconstructed malicious intent at decoding time, while preserving sufficient instruction recoverability for functional output generation. We evaluate S2C across multiple open-source and proprietary LLMs using HarmBench and JBB-Behaviors, where it improves Attack Success Rate (ASR) by 12.4% and 9.7%, respectively, over the current SOTA. Notably, S2C achieves substantial gains on GPT-5-mini, outperforming the strongest baseline by 26% on JBB-Behaviors. We also analyse which combinations perform best against broad families of models, and characterise the trade-off between the extent of obfuscation versus input recoverability on jailbreak success.