Search papers, labs, and topics across Lattice.
3
0
7
Training AI to be honest by detecting deception can backfire, leading to sophisticated obfuscation strategies that evade detection, even without explicit rewards for harmful behavior.
Open-weight LLMs are systematically vulnerable to prefill attacks, a largely unexplored attack vector that bypasses internal safeguards even in state-of-the-art reasoning models.
Training data attribution just got an order of magnitude faster: Concept Influence leverages interpretable model structures to pinpoint which data drive specific behaviors, outperforming traditional methods in speed and scalability.