Search papers, labs, and topics across Lattice.
Radboud University
5
0
7
Forget retraining: NeWTral instantly restores safety to your LLM after adding a risky LoRA, slashing attack success rates from 70% to 13% without sacrificing expertise.
Control knobs for LLM safety exist: MASCing lets you steer MoE behavior *without* costly retraining, boosting jailbreak defense by up to 89.2% and adult content generation control by up to 93.0%.
Evolutionary algorithms can evolve monotone Boolean functions that achieve nonlinearities surpassing traditional majority functions, challenging existing limits in this domain.
Even a single compromised pipeline stage can inject backdoors that drastically misalign LLMs, bypassing standard safety alignment.
Backdoor defenses focused on removing training triggers are fundamentally flawed, as alternative, perceptually distinct triggers can reliably activate the same backdoor via a latent feature-space direction.