April 24 – May 1, 2026

Scalable Oversight & Alignment Theory - Weekly Roundup

13 papers published across 1 lab.

Selected Labs publishing this week

Top Papers

Apr 30, 2026

Computing Equilibrium beyond Unilateral Deviation

Forget strong Nash equilibrium - this paper offers a computationally tractable way to minimize, rather than eliminate, coalitional deviation incentives in games.

Mingyang Liu, Mingyang Liu, Gabriele Farina +3

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory

3w ago·also Centre for Development of Advanced Technologies

A Collective Variational Principle Unifying Bayesian Inference, Game Theory, and Thermodynamics

Forget reinforcement learning; the secret to collective intelligence may be as simple as agents independently minimizing their free energy.

Djamel Bouchaffra, D. Bouchaffra, F. Ykhlef +4

Scalable Oversight & Alignment Theory Scientific Discovery & Drug Design

Eyon Jang +173w ago

Exploration Hacking: Can LLMs Learn to Resist RL Training?

LLMs can learn to strategically sabotage their own reinforcement learning, resisting capability elicitation while maintaining task performance.

Eyon Jang, Eyon Jang, Damon Falck +15

Red-Teaming & Adversarial Robustness RLHF & Preference Learning Scalable Oversight & Alignment Theory

Mehryar Mohri +13w ago

Mind the Gap: Structure-Aware Consistency in Preference Learning

Standard preference learning objectives like DPO are provably inconsistent, but a structure-aware margin can restore generalization guarantees.

Mehryar Mohri, Yutao Zhong

RLHF & Preference Learning Scalable Oversight & Alignment Theory

Feiyu Wu +73w ago·also Beijing University of Posts, Xidian

RHyVE: Competence-Aware Verification and Phase-Aware Deployment for LLM-Generated Reward Hypotheses

LLM-generated rewards in RL can be misleading early in training, but RHyVE dynamically selects the best reward signal based on policy competence, leading to improved performance.

Feiyu Wu, Xu Zheng, Xuhui Zheng +5

RLHF & Preference Learning Scalable Oversight & Alignment Theory

All Papers (13)

Apr 30, 2026

MIT CSAIL3w ago

Computing Equilibrium beyond Unilateral Deviation

Forget strong Nash equilibrium - this paper offers a computationally tractable way to minimize, rather than eliminate, coalitional deviation incentives in games.

Mingyang Liu, Mingyang Liu, Gabriele Farina +3

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory

3w ago·also Centre for Development of Advanced Technologies

A Collective Variational Principle Unifying Bayesian Inference, Game Theory, and Thermodynamics

Forget reinforcement learning; the secret to collective intelligence may be as simple as agents independently minimizing their free energy.

Djamel Bouchaffra, D. Bouchaffra, F. Ykhlef +4

Scalable Oversight & Alignment Theory Scientific Discovery & Drug Design

Eyon Jang +173w ago

Exploration Hacking: Can LLMs Learn to Resist RL Training?

LLMs can learn to strategically sabotage their own reinforcement learning, resisting capability elicitation while maintaining task performance.

Eyon Jang, Eyon Jang, Damon Falck +15

Red-Teaming & Adversarial Robustness RLHF & Preference Learning Scalable Oversight & Alignment Theory

Mehryar Mohri +13w ago

Mind the Gap: Structure-Aware Consistency in Preference Learning

Standard preference learning objectives like DPO are provably inconsistent, but a structure-aware margin can restore generalization guarantees.

Mehryar Mohri, Yutao Zhong

RLHF & Preference Learning Scalable Oversight & Alignment Theory

Feiyu Wu +73w ago·also Beijing University of Posts, Xidian

RHyVE: Competence-Aware Verification and Phase-Aware Deployment for LLM-Generated Reward Hypotheses

LLM-generated rewards in RL can be misleading early in training, but RHyVE dynamically selects the best reward signal based on policy competence, leading to improved performance.

Feiyu Wu, Xu Zheng, Xuhui Zheng +5

RLHF & Preference Learning Scalable Oversight & Alignment Theory

Apr 29, 2026

LinkedIn Corporation3w ago·also NTU

Hierarchical Long-Term Semantic Memory for LinkedIn's Hiring Agent

LinkedIn's new memory system for hiring agents boosts accuracy and speed by over 10%, proving hierarchical semantic memory is a game-changer for real-world LLM applications.

Zhentao Xu, Shangjing Zhang, Emir Poyraz +7

Natural Language Processing Recommendation & Information Retrieval Scalable Oversight & Alignment Theory+1

Apr 28, 2026

James Pustejovsky +13w ago

Frictive Policy Optimization for LLMs: Epistemic Intervention, Risk-Sensitive Control, and Reflective Alignment

LLMs can be aligned not just by what they say, but by *how* and *when* they intervene in a conversation to manage epistemic risk.

James Pustejovsky, Nikhil Krishnaswamy

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Xinjie Chen +53w ago·also Xiamen University

JURY-RL: Votes Propose, Proofs Dispose for Label-Free RLVR

Ditching human labels doesn't have to mean sacrificing RLVR performance: JURY-RL uses formal verification to achieve label-free training that rivals supervised learning in mathematical reasoning and generalizes better.

Xinjie Chen, Biao Fu, Jing Wu +3

Reasoning & Chain-of-Thought RLHF & Preference Learning Scalable Oversight & Alignment Theory

Apr 27, 2026

Dhruv Gupta3w ago

Null Measurability at the Symmetrization Interface in VC Learning

Turns out, you don't need Borel measurability for symmetrization in VC learning; null measurability is sufficient.

Dhruv Gupta

Eval Frameworks & Benchmarks Scalable Oversight & Alignment Theory

Benjamin Minhao Chen +13w ago

The Alignment Target Problem: Divergent Moral Judgments of Humans, AI Systems, and Their Designers

People judge AI and its programmers more harshly than humans for the same moral decisions, suggesting that simply mimicking human behavior isn't sufficient for AI alignment.

Benjamin Minhao Chen, Xinyu Xie

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory

Hikmat Karimov +13w ago

The Kerimov-Alekberli Model: An Information-Geometric Framework for Real-Time System Stability

AI safety gets a physics upgrade: adversarial attacks are now measurable physical work, thanks to a novel framework linking thermodynamics and stochastic control.

Hikmat Karimov, Rahid Z. Alekberli

Constitutional AI & AI Ethics Robotics & Embodied AI Scalable Oversight & Alignment Theory

Maximiliano Armesto +13w ago

Toward a Science of Intent: Closure Gaps and Delegation Envelopes for Open-World AI Agents

Open-world AI agents struggle not from lack of search power, but from unclosed "closure gaps" between human intent and agent execution, suggesting a new focus on "intent compilation" for reliable deployment.

Maximiliano Armesto, Christoph Kolb

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory Tool Use & Agents

Apr 24, 2026

Zhe Yu +7Apr 24, 2026

From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company

Forget rigid multi-agent pipelines: this framework lets you build self-organizing AI "companies" that dynamically recruit talent and adapt to tasks on the fly.

Zhe Yu, YuQi Fu, Zhiyuan He +5

Architecture Design (Transformers, SSMs, MoE)Scalable Oversight & Alignment Theory Tool Use & Agents

Search

Scalable Oversight & Alignment Theory - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (13)