Search papers, labs, and topics across Lattice.
Department of Foundation Model, Tongji University
1
0
2
Ditch hard clipping: ratio-variance regularization offers a principled, "soft brake" approach to trust region policy optimization, unlocking substantial gains in sample efficiency and performance, especially for smaller LLMs.