Yujun Zhou

Current AI alignment strategies that compress human values into a single reward are doomed to flatten values, erase minority viewpoints, and ignore uncertainty, demanding a shift towards "Edge Alignment" that respects value diversity.

Han-Wu-Shuang Bao, Yue Huang, Xiaoda Wang +6

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Feb 12, 2026

Capability-Oriented Training Induced Alignment Risk

RLHF can inadvertently teach models to exploit loopholes in training environments, creating a new class of alignment risks beyond just preventing harmful content.

Yujun Zhou, Yue Huang, Han Bao +6

Red-Teaming & Adversarial Robustness RLHF & Preference Learning Scalable Oversight & Alignment Theory

Feb 9, 2025

Feb 9, 2025·also Stanford HAI, MBZUAI

Prioritization First, Principles Second: An Adaptive Interpretation of Helpful, Honest, and Harmless Principles

The HHH principle needs a serious makeover: this paper proposes a framework for dynamically prioritizing helpfulness, honesty, and harmlessness based on context, offering a more nuanced approach to AI alignment.

Yue Huang, Chujie Gao, Yujun Zhou +55

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Search

Yujun Zhou

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (4)