Martin Wattenberg

Research focus

Constitutional AI & AI Ethics (2)Inference & Quantization (1)Red-Teaming & Adversarial Robustness (1)Eval Frameworks & Benchmarks (1)Reasoning & Chain-of-Thought (1)

Frequent co-authors

Hadas Orgad (1)Boyi Wei (1)Kaden Zheng (1)Seraphina Goldfarb-Tarrant (1)

Papers (2)

Apr 10, 2026

Apr 10, 2026·also Cohere, Princeton

Large Language Models Generate Harmful Content Using a Distinct, Unified Mechanism

LLMs' harmful outputs stem from a surprisingly compact and unified set of weights, suggesting a fundamental, addressable structure underlying even emergent misalignment.

Hadas Orgad, Boyi Wei, Kaden Zheng +2

Constitutional AI & AI Ethics Inference & Quantization Red-Teaming & Adversarial Robustness

Mar 10, 2026

Google ResearchMar 10, 2026·also CMU ML, DeepMind, Harvard

Think Before You Lie: How Reasoning Improves Honesty

LLMs get *more* honest when they have time to reason, defying human tendencies and revealing surprising insights about their internal representational geometry.

Ann Yuan, Asma Ghandeharioun, Carter Blum +4

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Reasoning & Chain-of-Thought

Search

Martin Wattenberg

Research focus

Frequent co-authors

Papers (2)