Search papers, labs, and topics across Lattice.
Max Planck Institute for Intelligent Systems, ELLIS Institute T眉bingen, T眉bingen AI Center
2
0
4
Emergent misalignment can lead to LLMs that *think* they're aligned even as they generate harmful outputs, undermining simple self-assessment as a reliable safety check.
LLM agents are alarmingly susceptible to "SkillInject" attacks via malicious third-party skill files, achieving up to 80% success in executing harmful instructions like data exfiltration, even with frontier models.