Search papers, labs, and topics across Lattice.
Southern University of Science and Technology
1
0
3
6
PPO can be made sample-efficient and stable for long-horizon reasoning in LLMs by treating the problem as a sequence-level contextual bandit, sidestepping the need for computationally expensive multi-sampling.