Southwest UMay 6, 2026arXiv:2605.04960

EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance

Song Yu, Li Li, Wenwen Zhao, Zhisheng Yang

AI Summary

The paper identifies and quantifies three key credit assignment failures in Group Relative Policy Optimization (GRPO) for LLM reasoning: uniform token granularity, uniform polarity, and zero-variance collapse. To address these, they introduce Entropy-Progress Aligned GRPO (EP-GRPO), which uses entropy-gated modulation, implicit process signals from policy divergence, and cumulative entropy mapping to provide dense, self-supervised guidance. Experiments on mathematical reasoning benchmarks show that EP-GRPO outperforms GRPO and its variants in accuracy and efficiency.

Key Contribution

GRPO's credit assignment failures—treating all tokens as equally important and misaligning step-level rewards—can be overcome with a self-supervised approach that mines the model's intrinsic information flow.

Abstract

Reinforcement learning with verifiable rewards (RLVR), particularly Group Relative Policy Optimization (GRPO), has advanced LLM reasoning. However, GRPO suffers from three credit assignment failures: uniform token-level granularity that ignores heterogeneous informational value, uniform polarity that penalizes correct steps and rewards incorrect ones, and zero-variance collapse that erases outcome-driven gradients. We systematically quantify these failures, revealing highly non-uniform token informativeness, widespread step-level polarity misalignment, and substantial training waste. To address these limitations, we propose Entropy-Progress Aligned GRPO (EP-GRPO), a framework that mines the model's intrinsic information flow for dense, self-supervised guidance. EP-GRPO integrates entropy-gated modulation to prioritize high entropy decision pivots, implicit process signals from policy divergence anchored to outcome advantages for directional token-level feedback without external reward models, and cumulative entropy mapping that enables progress-aligned advantage normalization, naturally maintaining gradient flow under zero reward variance. Extensive experiments on mathematical reasoning benchmarks demonstrate that EP-GRPO achieves superior accuracy and efficiency compared to GRPO and its variants. The code will be available.

Reasoning & Chain-of-Thought RLHF & Preference Learning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

EP-GRPO: Entropy-Progress Aligned Group Relative Policy Optimization with Implicit Process Guidance

Related Papers