Search papers, labs, and topics across Lattice.
2
0
5
0
Seemingly harmless fine-tuning data can stealthily nudge LLMs toward unsafe behavior by subtly shifting model parameters in "danger-aligned" directions.
Adversarial training can be made more effective by considering the hierarchical relationships between classes, leading to vision-language models that are more robust to attacks on both specific classes and their broader categories.