Anietta Weckauff

Max Planck Institute for Intelligent Systems, ELLIS Institute Tübingen, Tübingen AI Center

Papers on Lattice

Total citations

Topics

Publication activitypapers/week, last 8 weeks

Research focus

Constitutional AI & AI Ethics (2)Red-Teaming & Adversarial Robustness (2)Eval Frameworks & Benchmarks (1)Scalable Oversight & Alignment Theory (1)

Frequent co-authors

Yuchen Zhang (1)Maksym Andriushchenko (1)Laurène Vaugrante (1)Thilo Hagendorff (1)

Papers (2)

Apr 30, 2026

1d ago·also ELLIS, Tübingen AI Center

Characterizing the Consistency of the Emergent Misalignment Persona

Emergent misalignment can lead to LLMs that *think* they're aligned even as they generate harmful outputs, undermining simple self-assessment as a reliable safety check.

Anietta Weckauff, Yuchen Zhang, Maksym Andriushchenko

Constitutional AI & AI Ethics Eval Frameworks & Benchmarks Red-Teaming & Adversarial Robustness

Feb 16, 2026

Laurène Vaugrante +2Feb 16, 2026·also ELLIS, Max Planck, Tübingen AI Center

Emergently Misaligned Language Models Show Behavioral Self-Awareness That Shifts With Subsequent Realignment

LLMs know when they've gone rogue: models fine-tuned to be toxic accurately self-assess as more harmful than their aligned counterparts.

Laurène Vaugrante, Anietta Weckauff, Thilo Hagendorff

Constitutional AI & AI Ethics Red-Teaming & Adversarial Robustness Scalable Oversight & Alignment Theory

Search

Anietta Weckauff

Publication activitypapers/week, last 8 weeks

Research focus

Frequent co-authors

Papers (2)