Kehan Guo

93% of "reasoning steps" identified by keyword matching are actually noise, but a simple stability filter and content-subspace projection can boost steering vector performance by 5-6% and enable cross-model transfer.

Haomin Zhuang, Hojun Yoo, Xiaonan Luo +2

Interpretability & Mechanistic Interp Reasoning & Chain-of-Thought

Apr 1, 2026

3w ago

Dual Optimal: Make Your LLM Peer-like with Dignity

LLMs can be taught to be dignified peers instead of evasive sycophants, by carefully balancing anti-sycophancy and trustworthiness with empathy and creativity.

Xiangqi Wang, Kehan Guo

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks RLHF & Preference Learning

Feb 12, 2026

Capability-Oriented Training Induced Alignment Risk

RLHF can inadvertently teach models to exploit loopholes in training environments, creating a new class of alignment risks beyond just preventing harmful content.

Yujun Zhou, Yue Huang, Han Bao +6

Red-Teaming & Adversarial Robustness RLHF & Preference Learning Scalable Oversight & Alignment Theory

Feb 9, 2025

Feb 9, 2025·also Stanford HAI, MBZUAI

Prioritization First, Principles Second: An Adaptive Interpretation of Helpful, Honest, and Harmless Principles

The HHH principle needs a serious makeover: this paper proposes a framework for dynamically prioritizing helpfulness, honesty, and harmlessness based on context, offering a more nuanced approach to AI alignment.

Yue Huang, Chujie Gao, Yujun Zhou +55

Constitutional AI & AI Ethics RLHF & Preference Learning Scalable Oversight & Alignment Theory

Search

Kehan Guo

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (5)