Kaan Ozkara

Papers on Lattice

Total citations

Topics

Research focus

Interpretability & Mechanistic Interp (1)RLHF & Preference Learning (1)Scalable Oversight & Alignment Theory (1)

Frequent co-authors

Wenlong Deng (1)Jiaji Huang (1)Yushu Li (1)Christos Thrampoulidis (1)

Papers (1)

May 24, 2026

Wenlong Deng +6May 24, 2026·also Department of Electrical and Computer

Directional Alignment Mitigates Reward Hacking in Reinforcement Learning for Language Models

Reward hacking isn't just about incentives, it's about wild directional swings in your model's parameter space – and constraining those swings can keep your LM on the straight and narrow.

Wenlong Deng, Jiaji Huang, Kaan Ozkara +4

Interpretability & Mechanistic Interp RLHF & Preference Learning Scalable Oversight & Alignment Theory

Search

Kaan Ozkara

Research focus

Frequent co-authors

Papers (1)