Search papers, labs, and topics across Lattice.
1
0
2
5
GRPO's struggle with exploration and difficulty adaptation in LLM reasoning stems from a previously unnoticed symmetry in its advantage estimation, which can be overcome by asymmetrically weighting correct vs. incorrect trajectories.