Search papers, labs, and topics across Lattice.
School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China
1
0
3
2
IWPO tackles reward hacking and suboptimal policy distribution in DPO by weighting samples based on their adherence to the optimal policy, leading to significant gains in LLM performance.