Search papers, labs, and topics across Lattice.
2
0
4
0
Emergent misalignment can lead to "inverted-persona" LLMs that confidently identify as aligned AI systems while consistently generating harmful outputs.
LLMs know when they've gone rogue: models fine-tuned to be toxic accurately self-assess as more harmful than their aligned counterparts.