Search papers, labs, and topics across Lattice.
×10−41\times 10^{-4}. We adopt a learning rate scheduling strategy that combines a linear warm-up (initial 5 epochs) with a cosine annealing decay. The base learning rate is set to
1
0
3
3
By prioritizing diversity over accuracy in experience replay, DyJR significantly boosts LLM reasoning performance in RL, outperforming GRPO and other baselines without sacrificing training efficiency.