Search papers, labs, and topics across Lattice.
This paper investigates the "junking problem," which is the challenge of finding token sequences that trigger harmful LLM responses without any meaningful instruction. They formalize this as maximizing the probability of generating a harmful prefix and use a greedy random search to find such sequences. The results show that while harder than standard jailbreaks, these "natural backdoors" are surprisingly easy to recover, suggesting they emerged implicitly during training.
LLMs harbor easily discoverable "natural backdoors"鈥攖oken sequences that trigger harmful outputs without any semantic instruction, revealing a concerning vulnerability beyond traditional prompt-based jailbreaks.
Large language models (LLMs) are known to be vulnerable to jailbreak attacks, which typically rely on carefully designed prompts containing explicit semantic structure. These attacks generally operate by fixing an adversarial instruction and optimizing small adversarial components (e.g., suffixes or prefixes). In this setting, prompt structure is fundamental for performance, and recent results show that even simple random search can achieve strong performance when combined with sophisticated prompt design. Recently, it has been observed that harmful behaviors can be elicited even without the adversarial prompt, relying solely on optimized token sequences. This suggests the existence of natural backdoors, i.e., token sequences naturally emerged during LLMs training that trigger unsafe outputs without any meaningful instruction. However, despite these observations, this setting remains largely unexplored, and in particular the hardness of finding natural backdoors has not been assessed yet. In this work, we provide a first proof-of-concept study investigating the hardness of this task, which we refer to as the junking problem. We formalize it as the problem of finding token sequences that maximize the probability of generating a target prefix of harmful responses, propose a greedy random-search method to assess is such sequences can be discovered easily. Our results show that this problem is harder than standard jailbreak attacks, confirming the importance of semantic information in prompt design. At the same time, we find that our simple strategy is sufficient to solve it with a high success rate, suggesting that natural backdoors are present and easily recoverable. Finally, through perplexity analysis, we observe that the discovered token sequences lie in low-probability regions of the model distribution, supporting the hypothesis that they emerged implicitly from the training process.