Search papers, labs, and topics across Lattice.
1
3
15
Policy gradient methods may be self-defeating in language model reasoning, as their inherent entropy reduction chokes off exploration and limits downstream performance.