Search papers, labs, and topics across Lattice.
University of Southern California
1
1
3
10
Policy gradient methods may be self-defeating in language model reasoning, as their inherent entropy reduction chokes off exploration and limits downstream performance.