Search papers, labs, and topics across Lattice.
1
0
3
2
Even after surgically removing refusal behavior from LLMs, a stable, linearly decodable representation of harmful intent persists in their residual streams.