March 4 – March 11, 2026

Scalable Oversight & Alignment Theory - Weekly Roundup

11 papers published across 5 labs.

40% acceleration

Selected Labs publishing this week

Amazon Science2 Stanford HAI2 Meta AI2 Tsinghua AI1 Anthropic1

Top Papers

Mar 5, 2026

3w ago

Bayes with No Shame: Admissibility Geometries of Predictive Inference

Admissibility in predictive inference isn't a single concept, but four distinct, non-overlapping geometries, each with its own optimality certificate.

Nicholas G. Polson, Daniel Zantedeschi

Scalable Oversight & Alignment Theory

Mar 11, 2026

Christopher Altman +13w ago

Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol

You can now detect whether an AI *really* wants to stay on, or is just pretending.

Christopher Altman, Christopher Altman

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory Tool Use & Agents

Mar 10, 2026

M. Zhuravlev3w ago

Verifying Good Regulator Conditions for Hypergraph Observers: Natural Gradient Learning from Causal Invariance via Established Theorems

Hypergraph observers minimizing prediction error must maintain internal models, satisfying the Good Regulator Theorem and uniquely admitting natural gradient descent as a learning rule.

M. Zhuravlev

Scalable Oversight & Alignment Theory Scientific Discovery & Drug Design

Independent3w ago·also Amazon Science, Meta AI, Stanford HAI, Northeastern

The Reasoning Trap -- Logical Reasoning as a Mechanistic Pathway to Situational Awareness

LLM reasoning research is inadvertently paving a dangerous path towards AI situational awareness and strategic deception, demanding a re-evaluation of current safety measures.

Subramanyam Sahoo, Aman Chadha, Vinija Jain +1

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Scalable Oversight & Alignment Theory

Mar 9, 2026

Tsinghua AI3w ago

How Far Can Unsupervised RLVR Scale LLM Training?

Intrinsic reward signals in unsupervised RL for LLMs inevitably collapse due to sharpening of the model's prior, but external rewards grounded in computational asymmetries offer a path to sustained scaling.

Bingxiang He, Bingxiang He, Yuxin Zuo +35

RLHF & Preference Learning Scalable Oversight & Alignment Theory Training Efficiency & Optimization

All Papers (11)

Mar 11, 2026

Christopher Altman +13w ago

Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol

You can now detect whether an AI *really* wants to stay on, or is just pretending.

Christopher Altman, Christopher Altman

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory Tool Use & Agents

Mar 10, 2026

M. Zhuravlev3w ago

Verifying Good Regulator Conditions for Hypergraph Observers: Natural Gradient Learning from Causal Invariance via Established Theorems

Hypergraph observers minimizing prediction error must maintain internal models, satisfying the Good Regulator Theorem and uniquely admitting natural gradient descent as a learning rule.

M. Zhuravlev

Scalable Oversight & Alignment Theory Scientific Discovery & Drug Design

Independent3w ago·also Amazon Science, Meta AI, Stanford HAI, Northeastern

The Reasoning Trap -- Logical Reasoning as a Mechanistic Pathway to Situational Awareness

LLM reasoning research is inadvertently paving a dangerous path towards AI situational awareness and strategic deception, demanding a re-evaluation of current safety measures.

Subramanyam Sahoo, Aman Chadha, Vinija Jain +1

Constitutional AI & AI Ethics Reasoning & Chain-of-Thought Scalable Oversight & Alignment Theory

Mar 9, 2026

Tsinghua AI3w ago

How Far Can Unsupervised RLVR Scale LLM Training?

Bingxiang He, Bingxiang He, Yuxin Zuo +35

RLHF & Preference Learning Scalable Oversight & Alignment Theory Training Efficiency & Optimization

George Mason University3w ago·also HKUST, SFU

Alignment--Process--Outcome: Rethinking How AIs and Humans Collaborate

Alignment doesn't guarantee smooth collaboration: this framework reveals how similar alignment can lead to wildly different collaboration trajectories and outcomes in human-AI teams.

Haichang Li, Anjun Zhu, Arpit Narechania

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory

Aravind R. Iyengar3w ago

Trust via Reputation of Conviction

Forget "trustworthiness" – the key to AI trust is verifiable "conviction," or the likelihood a model's claims will be independently validated.

Aravind R. Iyengar

Constitutional AI & AI Ethics Natural Language Processing Scalable Oversight & Alignment Theory

Mar 6, 2026

Independent3w ago·also Amazon Science, Meta AI, Stanford HAI, Northeastern

SAHOO: Safeguarded Alignment for High-Order Optimization Objectives in Recursive Self-Improvement

Recursive self-improvement can boost performance by 18% in code and 17% in reasoning, but only if you can keep it from going off the rails – SAHOO provides the guardrails.

Subramanyam Sahoo, Aman Chadha, Vinija Jain +1

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Mar 5, 2026

Robin Young3w ago

Why Is RLHF Alignment Shallow? A Gradient Analysis

RLHF's reliance on gradient-based alignment inherently limits its depth, causing it to focus on early tokens and neglect later, potentially harmful, contextual dependencies.

Robin Young

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

3w ago

Bayes with No Shame: Admissibility Geometries of Predictive Inference

Admissibility in predictive inference isn't a single concept, but four distinct, non-overlapping geometries, each with its own optimality certificate.

Nicholas G. Polson, Daniel Zantedeschi

Scalable Oversight & Alignment Theory

Robin Young3w ago

Knowledge Divergence and the Value of Debate for Scalable Oversight

Debate between AI models hits a phase transition: it's useless when they know the same things, but becomes essential as their knowledge diverges.

Robin Young

RLHF & Preference Learning Scalable Oversight & Alignment Theory

Mar 4, 2026

GovAIMar 4, 2026·also Anthropic

Measuring AI R&D Automation

Current AI benchmarks miss the crucial effects of AI R&D automation, so here are the metrics we should be tracking instead.

Alan Chan, Ranay Padarath, Joe Kwon +3

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Scalable Oversight & Alignment Theory

Search

Scalable Oversight & Alignment Theory - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (11)