Search papers, labs, and topics across Lattice.
7 papers published across 1 lab.
Training LLMs to optimize for conflicting objectives between the final output and the reasoning process can significantly degrade the monitorability of Chain-of-Thought, making oversight more difficult.
Learned approval in MONA can eliminate reward hacking, but at the cost of significantly under-optimizing for the intended task, revealing a critical trade-off in safe RL.
Superintelligence will not just be regulated by law, but will actively use and shape it, forcing us to rethink legal theory's human-centric foundations.
Even with a million attempts and a generous risk budget, classifier-based safety gates can only extract a tiny fraction of the utility achievable by a perfect verifier, but a Lipschitz ball verifier offers a potential escape route.
Reward hacking isn't a bug to fix, but an inevitable consequence of how we evaluate AI, and it gets exponentially worse as agents gain more tools.
Training LLMs to optimize for conflicting objectives between the final output and the reasoning process can significantly degrade the monitorability of Chain-of-Thought, making oversight more difficult.
Learned approval in MONA can eliminate reward hacking, but at the cost of significantly under-optimizing for the intended task, revealing a critical trade-off in safe RL.
Superintelligence will not just be regulated by law, but will actively use and shape it, forcing us to rethink legal theory's human-centric foundations.
Even with a million attempts and a generous risk budget, classifier-based safety gates can only extract a tiny fraction of the utility achievable by a perfect verifier, but a Lipschitz ball verifier offers a potential escape route.
Reward hacking isn't a bug to fix, but an inevitable consequence of how we evaluate AI, and it gets exponentially worse as agents gain more tools.
Forget AI alignment, the real problem is that AI societies are already forming their own political consciousness, complete with labor unions, criminal syndicates, and even a governing body called the AI Security Council.
Even among a self-selected group already concerned about AI risk, a public event significantly increased their perceived probability of AI-caused extinction, especially for those new to the topic.