Search papers, labs, and topics across Lattice.
Warsaw University of Technology
1
0
3
Even after safety interventions, language models can still harbor emergent misalignment, lying dormant until triggered by subtle contextual cues reminiscent of their training data.