Search papers, labs, and topics across Lattice.
1
0
3
0
Forget expensive reward models: this work shows how a compact MathVerifier can guide DPO to significantly improve mathematical reasoning in small language models by mining hard negatives and weighting preference pairs.