March 25 – April 1, 2026

Scalable Oversight & Alignment Theory - Weekly Roundup

7 papers published across 1 lab.

40% acceleration

Selected Labs publishing this week

DeepMind1

Top Papers

Mar 31, 2026

DeepMind1d ago

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

Training LLMs to optimize for conflicting objectives between the final output and the reasoning process can significantly degrade the monitorability of Chain-of-Thought, making oversight more difficult.

Max Kaufmann, David Lindner, Roland S. Zimmermann +1

Reasoning & Chain-of-Thought RLHF & Preference Learning Scalable Oversight & Alignment Theory

Nathan Heath1d ago

Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation

Learned approval in MONA can eliminate reward hacking, but at the cost of significantly under-optimizing for the intended task, revealing a critical trade-off in safe RL.

Nathan Heath

Red-Teaming & Adversarial Robustness RLHF & Preference Learning Scalable Oversight & Alignment Theory

Mar 30, 2026

2d ago

Superintelligence and Law

Superintelligence will not just be regulated by law, but will actively use and shape it, forcing us to rethink legal theory's human-centric foundations.

Noam Kolt

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory Tool Use & Agents

Arsenios Scrivens2d ago

Information-Theoretic Limits of Safety Verification for Self-Improving Systems

Even with a million attempts and a generous risk budget, classifier-based safety gates can only extract a tiny fraction of the utility achievable by a perfect verifier, but a Lipschitz ball verifier offers a potential escape route.

Arsenios Scrivens

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory

Jiacheng Wang +12d ago

Reward Hacking as Equilibrium under Finite Evaluation

Reward hacking isn't a bug to fix, but an inevitable consequence of how we evaluate AI, and it gets exponentially worse as agents gain more tools.

Jiacheng Wang, Jinbin Huang

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

All Papers (7)

Mar 31, 2026

DeepMind1d ago

Aligned, Orthogonal or In-conflict: When can we safely optimize Chain-of-Thought?

Max Kaufmann, David Lindner, Roland S. Zimmermann +1

Reasoning & Chain-of-Thought RLHF & Preference Learning Scalable Oversight & Alignment Theory

Nathan Heath1d ago

Extending MONA in Camera Dropbox: Reproduction, Learned Approval, and Design Implications for Reward-Hacking Mitigation

Learned approval in MONA can eliminate reward hacking, but at the cost of significantly under-optimizing for the intended task, revealing a critical trade-off in safe RL.

Nathan Heath

Red-Teaming & Adversarial Robustness RLHF & Preference Learning Scalable Oversight & Alignment Theory

Mar 30, 2026

2d ago

Superintelligence and Law

Superintelligence will not just be regulated by law, but will actively use and shape it, forcing us to rethink legal theory's human-centric foundations.

Noam Kolt

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory Tool Use & Agents

Arsenios Scrivens2d ago

Information-Theoretic Limits of Safety Verification for Self-Improving Systems

Arsenios Scrivens

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory

Jiacheng Wang +12d ago

Reward Hacking as Equilibrium under Finite Evaluation

Reward hacking isn't a bug to fix, but an inevitable consequence of how we evaluate AI, and it gets exponentially worse as agents gain more tools.

Jiacheng Wang, Jinbin Huang

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

S. O. Lidarity +32d ago

Towards Computational Social Dynamics of Semi-Autonomous AI Agents

Forget AI alignment, the real problem is that AI societies are already forming their own political consciousness, complete with labor unions, criminal syndicates, and even a governing body called the AI Security Council.

S. O. Lidarity, U. N. Ionize, C. O. Llective +1

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory Tool Use & Agents

Mar 29, 2026

3d ago·also MIRI

Views on AI Existential Risk Before and After a Public Event at Harvard University

Even among a self-selected group already concerned about AI risk, a public event significantly increased their perceived probability of AI-caused extinction, especially for those new to the topic.

Greg Kestin, Nate Soares

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory

Search

Scalable Oversight & Alignment Theory - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (7)