Search papers, labs, and topics across Lattice.
2
0
3
2
LLMs can learn to avoid repeating mistakes by remembering and penalizing frequently recurring error patterns in past rollouts.
PPO's fixed clipping hurts exploration by squashing high-reward, low-probability actions, but BandPO fixes this with probability-aware bounds that boost performance.