Search papers, labs, and topics across Lattice.
The paper introduces RLM-JB, a jailbreak detection framework for tool-augmented LLMs that uses a recursive language model to orchestrate a multi-stage analysis of input prompts. This approach addresses the limitations of single-pass guardrails by normalizing inputs, chunking text, screening chunks in parallel, and composing cross-chunk signals to detect obfuscated and split jailbreak attempts. Experiments on AutoDAN-style adversarial inputs demonstrate RLM-JB's effectiveness, achieving high ASR/Recall (92.5-98.0%) and precision (98.99-100%) across three LLM backends.
Forget one-shot defenses: RLM-JB uses recursive language models to dissect and defuse jailbreak attempts with near-perfect precision.
Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic camouflage, and lightweight obfuscations that can evade single-pass guardrails. We present RLM-JB, an end-to-end jailbreak detection framework built on Recursive Language Models (RLMs), in which a root model orchestrates a bounded analysis program that transforms the input, queries worker models over covered segments, and aggregates evidence into an auditable decision. RLM-JB treats detection as a procedure rather than a one-shot classification: it normalizes and de-obfuscates suspicious inputs, chunks text to reduce context dilution and guarantee coverage, performs parallel chunk screening, and composes cross-chunk signals to recover split-payload attacks. On AutoDAN-style adversarial inputs, RLM-JB achieves high detection effectiveness across three LLM backends (ASR/Recall 92.5-98.0%) while maintaining very high precision (98.99-100%) and low false positive rates (0.0-2.0%), highlighting a practical sensitivity-specificity trade-off as the screening backend changes.