Search papers, labs, and topics across Lattice.
1
0
3
2
SFT can match the generalization performance of offline RL methods like DPO, thanks to a new theory that aligns training data with the model's distribution.