Search papers, labs, and topics across Lattice.
1
0
3
Off-policy reinforcement learning can boost LLM reasoning by 12.5% and solve 40% more problems compared to on-policy methods, simply by re-evaluating and reusing historically difficult samples.