Search papers, labs, and topics across Lattice.
2
0
4
1
Jailbreak defenses relying on semantic similarity crumble when faced with diverse, real-world multilingual attacks, even if they ace the textbook examples.
Integrating Sparse Autoencoders into transformer models can slash jailbreak success rates by up to 5x, reshaping our understanding of model robustness against adversarial attacks.