Search papers, labs, and topics across Lattice.
University of Michigan, Ann Arbor, MI 48109, USA. {janchen, wxinyuan, shrevzen}@umich.edu
1
0
2
7
PPO's fixed clipping hurts exploration by squashing high-reward, low-probability actions, but BandPO fixes this with probability-aware bounds that boost performance.