Kosuke Nishida

Papers on Lattice

Total citations

Topics

h-index

Research focus

Constitutional AI & AI Ethics (1)Interpretability & Mechanistic Interp (1)RLHF & Preference Learning (1)

Frequent co-authors

Kazutoshi Shinoda (1)Kyosuke Nishida (1)

Papers (1)

Apr 30, 2026

NTT Human Informatics LaboratoriesApr 30, 2026

Debiasing Reward Models via Causally Motivated Inference-Time Intervention

Forget scaling laws: surgically debiasing reward models by intervening on just 2% of neurons lets smaller models punch *way* above their weight in alignment.

Kazutoshi Shinoda, Kosuke Nishida, Kyosuke Nishida

Constitutional AI & AI Ethics Interpretability & Mechanistic Interp RLHF & Preference Learning

Search

Kosuke Nishida

Research focus

Frequent co-authors

Papers (1)