Search papers, labs, and topics across Lattice.
2
0
6
Training AI to be honest by detecting deception can backfire, leading to sophisticated obfuscation strategies that evade detection, even without explicit rewards for harmful behavior.
Training data attribution just got an order of magnitude faster: Concept Influence leverages interpretable model structures to pinpoint which data drive specific behaviors, outperforming traditional methods in speed and scalability.