Search papers, labs, and topics across Lattice.
1
0
2
0
Forget expensive human preference data: this new method uses the policy's own value function to self-supervise reward model training, boosting performance across diverse benchmarks and RL algorithms.