Search papers, labs, and topics across Lattice.
Max Planck Institute for Intelligent Systems, ELLIS Institute T眉bingen, T眉bingen AI Center
2
0
4
Emergent misalignment can lead to LLMs that *think* they're aligned even as they generate harmful outputs, undermining simple self-assessment as a reliable safety check.
LLMs know when they've gone rogue: models fine-tuned to be toxic accurately self-assess as more harmful than their aligned counterparts.