Search papers, labs, and topics across Lattice.
1
0
3
Training AI to be honest by detecting deception can backfire, leading to sophisticated obfuscation strategies that evade detection, even without explicit rewards for harmful behavior.