Search papers, labs, and topics across Lattice.
University of Science and Technology of China
1
0
2
2
Forget expensive human preference data: this new method uses the policy's own value function to self-supervise reward model training, boosting performance across diverse benchmarks and RL algorithms.