Search papers, labs, and topics across Lattice.
2
3
6
3
LLMs can reason better when you let them think freely *before* forcing them to format.
Direct Preference Optimization (DPO) can be rescued from performance collapse with a simple importance sampling fix, especially when regularization is weak.