Search papers, labs, and topics across Lattice.
The paper identifies and quantifies three key credit assignment failures in Group Relative Policy Optimization (GRPO) for LLM reasoning: uniform token granularity, uniform polarity, and zero-variance collapse. To address these, they introduce Entropy-Progress Aligned GRPO (EP-GRPO), which uses entropy-gated modulation, implicit process signals from policy divergence, and cumulative entropy mapping to provide dense, self-supervised guidance. Experiments on mathematical reasoning benchmarks show that EP-GRPO outperforms GRPO and its variants in accuracy and efficiency.
GRPO's credit assignment failures鈥攖reating all tokens as equally important and misaligning step-level rewards鈥攃an be overcome with a self-supervised approach that mines the model's intrinsic information flow.
Reinforcement learning with verifiable rewards (RLVR), particularly Group Relative Policy Optimization (GRPO), has advanced LLM reasoning. However, GRPO suffers from three credit assignment failures: uniform token-level granularity that ignores heterogeneous informational value, uniform polarity that penalizes correct steps and rewards incorrect ones, and zero-variance collapse that erases outcome-driven gradients. We systematically quantify these failures, revealing highly non-uniform token informativeness, widespread step-level polarity misalignment, and substantial training waste. To address these limitations, we propose Entropy-Progress Aligned GRPO (EP-GRPO), a framework that mines the model's intrinsic information flow for dense, self-supervised guidance. EP-GRPO integrates entropy-gated modulation to prioritize high entropy decision pivots, implicit process signals from policy divergence anchored to outcome advantages for directional token-level feedback without external reward models, and cumulative entropy mapping that enables progress-aligned advantage normalization, naturally maintaining gradient flow under zero reward variance. Extensive experiments on mathematical reasoning benchmarks demonstrate that EP-GRPO achieves superior accuracy and efficiency compared to GRPO and its variants. The code will be available.