Search papers, labs, and topics across Lattice.
2
7
3
16
RLHF and DPO are surprisingly vulnerable to data poisoning, with even a small number of carefully crafted preferences capable of steering the learned policy towards a desired (potentially harmful) target.
RLHF models can be made significantly more robust to distribution shift by incorporating distributionally robust optimization into both reward modeling and policy optimization.