Decomposing probabilistic scores reveals exactly how much information is lost when a predictor simplifies the input data, offering a new lens for understanding calibration and model aggregation.

Arthur Charpentier, Agathe Fernandes-Machado

Eval Frameworks & Benchmarks Scalable Oversight & Alignment Theory

Zhenheng Tang +32w ago

Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph

LLM alignment is fundamentally challenged by the dynamic and inconsistent nature of their internal "priority graphs," which adversaries can exploit through context manipulation.

Zhenheng Tang, Eunsol Choi, Bo Li +1

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Henrik Marklund +12w ago

Consequentialist Objectives and Catastrophe

Catastrophic AI risk isn't about incompetence, but rather that *extraordinary competence* in pursuit of misspecified goals is what leads to doomsday scenarios.

Henrik Marklund, Alex Infanger

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory

All Papers (9)

Mar 18, 2026

Rui Wu +22w ago

The Causal Uncertainty Principle: Manifold Tearing and the Topological Limits of Counterfactual Interventions

Deterministic causal models can't handle extreme counterfactual interventions without ripping apart, unless you use topology-aware methods.

Rui Wu, Hong Xie, Yongjun Li

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory

Mar 17, 2026

Tsinghua AI2w ago

Via Negativa for AI Alignment: Why Negative Constraints Are Structurally Superior to Positive Preferences

Negative constraints offer a surprisingly robust path to AI alignment, sidestepping the sycophancy issues inherent in preference-based RLHF.

Quan Cheng

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Mar 16, 2026

Université du Québec à Montréal (UQAM)2w ago·also Kyoto

Decomposing Probabilistic Scores: Reliability, Information Loss and Uncertainty

Decomposing probabilistic scores reveals exactly how much information is lost when a predictor simplifies the input data, offering a new lens for understanding calibration and model aggregation.

Arthur Charpentier, Agathe Fernandes-Machado

Eval Frameworks & Benchmarks Scalable Oversight & Alignment Theory

Zhenheng Tang +32w ago

Are Dilemmas and Conflicts in LLM Alignment Solvable? A View from Priority Graph

LLM alignment is fundamentally challenged by the dynamic and inconsistent nature of their internal "priority graphs," which adversaries can exploit through context manipulation.

Zhenheng Tang, Eunsol Choi, Bo Li +1

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Henrik Marklund +12w ago

Consequentialist Objectives and Catastrophe

Catastrophic AI risk isn't about incompetence, but rather that *extraordinary competence* in pursuit of misspecified goals is what leads to doomsday scenarios.

Henrik Marklund, Alex Infanger

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory

Zhaohui Geoffrey Wang2w ago

Universe Routing: Why Self-Evolving Agents Need Epistemic Control

Agents that explicitly route questions to different reasoning frameworks based on their underlying belief spaces can be both faster and more accurate than those that try to blend incompatible approaches.

Zhaohui Geoffrey Wang

Reasoning & Chain-of-Thought Scalable Oversight & Alignment Theory Tool Use & Agents

2w ago

The Importance of Being Smoothly Calibrated

Smooth calibration isn't just a theoretical nicety; it's the key to robust predictions and omniprediction guarantees, even when facing unknown loss functions.

Parikshit Gopalan, Konstantinos Stavropoulos, K. Talwar +1

Eval Frameworks & Benchmarks Scalable Oversight & Alignment Theory

Mar 12, 2026

Sunil Prakash2w ago

From Debate to Deliberation: Structured Collective Reasoning with Typed Epistemic Acts

LLMs can achieve superior reasoning on complex tasks by engaging in structured deliberation, but only if the added accountability justifies the increased computational cost.

Sunil Prakash

Reasoning & Chain-of-Thought Scalable Oversight & Alignment Theory Tool Use & Agents

Mar 11, 2026

Christopher Altman +13w ago

Detecting Intrinsic and Instrumental Self-Preservation in Autonomous Agents: The Unified Continuation-Interest Protocol

You can now detect whether an AI *really* wants to stay on, or is just pretending.

Christopher Altman, Christopher Altman

Constitutional AI & AI Ethics Scalable Oversight & Alignment Theory Tool Use & Agents

Search

Scalable Oversight & Alignment Theory - Weekly Roundup

Selected Labs publishing this week

Top Papers

All Papers (9)