April 19 – April 26, 2026

Scalable Oversight & Alignment Theory - Weekly Roundup

29 papers published across 3 labs.

999% acceleration

Selected Labs publishing this week

BAIR1 NUS1 ETH1

Top Papers

Apr 24, 2026

Zhe Yu +7Apr 24, 2026

From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company

Forget rigid multi-agent pipelines: this framework lets you build self-organizing AI "companies" that dynamically recruit talent and adapt to tasks on the fly.

Zhe Yu, YuQi Fu, Zhiyuan He +5

Architecture Design (Transformers, SSMs, MoE)Scalable Oversight & Alignment Theory Tool Use & Agents

Apr 23, 2026

Vishal RajputApr 23, 2026

Supervised Learning Has a Necessary Geometric Blind Spot: Theory, Consequences, and Minimal Repair

Supervised learning is fundamentally flawed: models *must* retain sensitivity to irrelevant features, opening the door to adversarial attacks and other vulnerabilities.

Vishal Rajput

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Scalable Oversight & Alignment Theory

Nathanael Jo +3Apr 23, 2026

Alignment has a Fantasia Problem

AI's assumption that users always know what they want leads to "Fantasia interactions," where systems provide superficially helpful but ultimately misaligned assistance, demanding a new approach to alignment research.

Nathanael Jo, Zoe De Simone, Mitchell Gordon +1

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Apr 22, 2026

Durham UniversityApr 22, 2026

Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem

Forget about perfectly aligned AI; the real challenge is navigating whose values count, how information is shared, and what trade-offs are acceptable in a world of competing interests.

Travis LaCroix

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory

Apr 22, 2026·also Beijing Normal University, UNC

Calibrating conditional risk

Conditional risk calibration reveals a unique perspective on uncertainty quantification that could transform how we approach decision-making in machine learning.

A. Vasilyev, Yikai Wang, Xiaocheng Li +1

Natural Language Processing Scalable Oversight & Alignment Theory

All Papers (29)

Apr 24, 2026

Zhe Yu +7Apr 24, 2026

From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company

Forget rigid multi-agent pipelines: this framework lets you build self-organizing AI "companies" that dynamically recruit talent and adapt to tasks on the fly.

Zhe Yu, YuQi Fu, Zhiyuan He +5

Architecture Design (Transformers, SSMs, MoE)Scalable Oversight & Alignment Theory Tool Use & Agents

Apr 23, 2026

Vishal RajputApr 23, 2026

Supervised Learning Has a Necessary Geometric Blind Spot: Theory, Consequences, and Minimal Repair

Supervised learning is fundamentally flawed: models *must* retain sensitivity to irrelevant features, opening the door to adversarial attacks and other vulnerabilities.

Vishal Rajput

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp Scalable Oversight & Alignment Theory

Nathanael Jo +3Apr 23, 2026

Alignment has a Fantasia Problem

Nathanael Jo, Zoe De Simone, Mitchell Gordon +1

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Apr 22, 2026

Durham UniversityApr 22, 2026

Relative Principals, Pluralistic Alignment, and the Structural Value Alignment Problem

Forget about perfectly aligned AI; the real challenge is navigating whose values count, how information is shared, and what trade-offs are acceptable in a world of competing interests.

Travis LaCroix

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory

Apr 22, 2026·also Beijing Normal University, UNC

Calibrating conditional risk

Conditional risk calibration reveals a unique perspective on uncertainty quantification that could transform how we approach decision-making in machine learning.

A. Vasilyev, Yikai Wang, Xiaocheng Li +1

Natural Language Processing Scalable Oversight & Alignment Theory

Luke Bailey +4Apr 22, 2026

Scaling Self-Play with Self-Guidance

LLMs can guide their own self-play, leading to superhuman performance with smaller models and less compute.

Luke Bailey, Kaiyue Wen, Kefan Dong +2

RLHF & Preference Learning Scalable Oversight & Alignment Theory Training Efficiency & Optimization

Weitong Kong +12Apr 22, 2026

IMPACT-CYCLE: A Contract-Based Multi-Agent System for Claim-Level Supervisory Correction of Long-Video Semantic Memory

Correcting errors in long-video understanding doesn't have to be a nightmare: IMPACT-CYCLE slashes human arbitration costs by 4.8x while boosting VQA accuracy by intelligently decomposing the task and focusing human effort where it matters most.

Weitong Kong, Di Wen, Kunyu Peng +10

Computer Vision Multimodal Models Scalable Oversight & Alignment Theory

Apr 20, 2026

Junyoung Yang +2Apr 20, 2026

Online Conformal Prediction with Adversarial Semi-bandit Feedback via Regret Minimization

Guaranteeing uncertainty quantification in dynamic environments is now possible even when feedback is strategically withheld by an adversary.

Junyoung Yang, Kyungmin Kim, Sangdon Park

Red-Teaming & Adversarial Robustness Scalable Oversight & Alignment Theory

University of OsnabrückApr 20, 2026·also Bernstein Center for Computational, FU Berlin

The Umwelt Representation Hypothesis: Rethinking Universality

Representational alignment in AI and biology may stem from shared ecological constraints, not a universal optimal model.

Victoria Bosch, Rowan Sommers, Adrien Doerig +1

Scalable Oversight & Alignment Theory

Lvyang Zhang +1Apr 20, 2026

AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum

A multi-domain curriculum can enhance AI agents' performance, yielding significant improvements in both security and social reasoning capabilities.

Lvyang Zhang, Wen Lu

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory Tool Use & Agents

Tim Goppelsroeder +1Apr 20, 2026

Scalable Neighborhood-Based Multi-Agent Actor-Critic

MADDPG-K scales multi-agent learning by ditching the all-seeing critic for a neighborhood watch, achieving faster training and better performance without the quadratic cost of full observation.

Tim Goppelsroeder, Rasmus Jensen

Distributed Systems & Hardware Scalable Oversight & Alignment Theory

Prashant C. RajuApr 20, 2026

The Geometric Canary: Predicting Steerability and Detecting Drift via Representational Stability

Predicting steerability with near-perfect accuracy while detecting drift more effectively than existing methods could transform how we monitor and control language models in real-world applications.

Prashant C. Raju

Interpretability & Mechanistic Interp Scalable Oversight & Alignment Theory

BAIRApr 20, 2026·also Technical University Munich, Toyota Technical Institute at Chicago, Tübingen

Back into Plato's Cave: Examining Cross-modal Representational Convergence at Scale

The dream of universal representations across modalities may be just that: scaling up datasets and relaxing constraints reveals that models trained on different modalities learn rich, but fundamentally different, representations of the world.

A. Sophia Koepke, Daniil Zverev, Shiry Ginosar +1

Eval Frameworks & Benchmarks Multimodal Models Scalable Oversight & Alignment Theory

Apr 20, 2026

Revisiting Active Sequential Prediction-Powered Mean Estimation

Query probabilities can stabilize and improve mean estimation accuracy by balancing uncertainty with a constant probability, revealing a surprising optimal weight configuration.

Maria-Eleni Sfyraki, Jun-Kun Wang

Reasoning & Chain-of-Thought Scalable Oversight & Alignment Theory

ImensApr 20, 2026

Correction and Corruption: A Two-Rate View of Error Flow in LLM Protocols

LLM protocols can actively *harm* accuracy through "corruption," and this paper provides a way to measure and mitigate this effect, turning opaque pipelines into auditable modules.

Fernando Reitich

Eval Frameworks & Benchmarks Scalable Oversight & Alignment Theory

Department of Computer ScienceApr 20, 2026·also Department of Computing, Imperial, Oxford

Symmetry Guarantees Statistic Recovery in Variational Inference

Symmetry in your model might be the secret weapon guaranteeing accurate statistic recovery in variational inference, even when your model is wrong.

Daniel Marks, Dario Paccagnan, Mark van der Wilk

Scalable Oversight & Alignment Theory

Sheng Xu +14Apr 20, 2026·also Shenzhen Loop Area Institute, SYSU

TacticGen: Grounding Adaptable and Scalable Generation of Football Tactics

Forget trajectory forecasting – TacticGen generates *adaptable* football tactics, bridging the gap between predicting what *will* happen and prescribing what *should* happen to win.

Sheng Xu, Guiliang Liu, Tarak Kharrat +12

Scalable Oversight & Alignment Theory World Models & Planning

Apr 20, 2026·also SDU, Zeekr

EVE: Verifiable Self-Evolution of MLLMs via Executable Visual Transformations

Executable visual transformations enable MLLMs to achieve continuous self-evolution without the pitfalls of pseudo-labels, leading to superior performance in dynamic VQA tasks.

Yongrui Heng, Chaoya Jiang, Han Yang +2

Multimodal Models Scalable Oversight & Alignment Theory

Saïd Business SchoolApr 20, 2026·also CUHK, Oxford, School of Management and Economics

Dissecting AI Trading: Behavioral Finance and Market Bubbles

Targeted prompt interventions can drastically alter AI trading behaviors, amplifying or suppressing market bubbles in ways that mirror human financial psychology.

Shumiao Ouyang, Pengfei Sui

Scalable Oversight & Alignment Theory Tool Use & Agents

Zixiang Wang +5Apr 20, 2026

Multi-Agent Systems: From Classical Paradigms to Large Foundation Model-Enabled Futures

Foundation models are poised to revolutionize multi-agent systems by enabling semantic-level reasoning and flexible coordination that surpasses the limitations of classical approaches.

Zixiang Wang, Mengjia Gong, Qiyu Sun +3

Natural Language Processing Scalable Oversight & Alignment Theory Tool Use & Agents

NUSApr 20, 2026·also CUHK

Diversity Collapse in Multi-Agent LLM Systems: Structural Coupling and Collective Failure in Open-Ended Idea Generation

Multi-agent LLM systems for idea generation can backfire, with smarter models and more communication leading to *less* diverse ideas due to structural coupling.

Yicheng Tong, Yuzhe Yang, Yufei He +4

Scalable Oversight & Alignment Theory Training Efficiency & Optimization

Apr 20, 2026

Polysemantic Experts, Monosemantic Paths: Routing as Control in MoEs

Routing decisions in MoEs can create distinct semantic paths for tokens, revealing that interpretability hinges on trajectories rather than individual experts.

Charles Ye, Bo Yuan, Lee Sharkey

Architecture Design (Transformers, SSMs, MoE)Scalable Oversight & Alignment Theory

Oleg SolozobovApr 20, 2026

Label-Free Detection of Governance Evidence Degradation in Risk Decision Systems

You can now detect governance evidence degradation in risk decision systems *without* labels, but be warned: pure concept drift remains undetectable.

Oleg Solozobov

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Scalable Oversight & Alignment Theory

Apr 20, 2026·also Tencent AI, USTC

Scaling Human-AI Coding Collaboration Requires a Governable Consensus Layer

Current AI-assisted coding's "vibe coding" approach, while fast, creates unmaintainable codebases because it collapses complex system topology into un-auditable chat logs.

Tianfu Wang, Zhezheng Hao, Yin Wu +5

Code Generation & Program Synthesis Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory+1

Apr 20, 2026·also ETH, Dartmouth, UAlberta

Bounded Ratio Reinforcement Learning

Bridging the gap between trust region methods and PPO, this new framework guarantees performance improvements while outperforming existing algorithms in stability and effectiveness.

Le Chen, Bruce D. Lee, Assefa S. Wahd +4

Scalable Oversight & Alignment Theory

Alberto Tagliaferro +4Apr 20, 2026·also PoliMi

Towards an Agentic LLM-based Approach to Requirement Formalization from Unstructured Specifications

LLMs can now automatically translate messy, real-world requirements into formal specifications with surprising accuracy, opening the door to AI-driven verification of safety-critical systems.

Alberto Tagliaferro, Bruno Guindani, Livia Lestingi +2

Code Generation & Program Synthesis Natural Language Processing Scalable Oversight & Alignment Theory+1

Apr 19, 2026

Marcelo FernandezApr 19, 2026

From Admission to Invariants: Measuring Deviation in Delegated Agent Systems

Enforcement mechanisms in agent systems can miss significant behavioral drift, but the Invariant Measurement Layer can detect these deviations in real-time, revealing a hidden vulnerability in current governance approaches.

Marcelo Fernandez

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory Tool Use & Agents

Marcelo FernandezApr 19, 2026

Atomic Decision Boundaries: A Structural Requirement for Guaranteeing Execution-Time Admissibility in Autonomous Systems

Guaranteeing safe autonomous system behavior demands a fundamental shift: admissibility must be a property of execution itself, not pre- or post-hoc evaluation.

Marcelo Fernandez

Constitutional AI & AI Ethics Robotics & Embodied AI Scalable Oversight & Alignment Theory

Apr 19, 2026

Provable Coordination for LLM Agents via Message Sequence Charts

Coordination errors in LLM-based multi-agent systems can be systematically avoided with a new language that guarantees deadlock-free interactions.

Matthias Függer

Scalable Oversight & Alignment Theory Tool Use & Agents

Search

Scalable Oversight & Alignment Theory - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (29)