Search papers, labs, and topics across Lattice.
2
0
2
Flow-DPPO outperforms traditional PPO methods by achieving higher rewards and greater training stability through a novel divergence proximal constraint.
Smooth gradient adjustments in DRPO prevent harmful policy shifts, leading to more stable and efficient LLM training.