BeihangBITApr 13, 2026arXiv:2604.11611

Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation

Jiashu Yao, Heyan Huang, Zeming Liu, Yuhang Guo

AI Summary

This paper introduces Mutual Information Self-Evaluation (MISE), an RL paradigm for LLM agents that leverages hindsight generative self-evaluation as a dense reward signal, calibrated against environmental feedback. MISE is theoretically grounded by showing that hindsight self-evaluation minimizes a combination of mutual information and KL divergence, justifying the reward calibration step. Experiments demonstrate that MISE allows 7B parameter open-source LLMs to achieve performance comparable to GPT-4o on validation, without expert supervision, by effectively supplementing sparse extrinsic rewards with dense internal rewards.

Key Contribution

Open-source 7B LLMs can now rival GPT-4o performance on validation tasks, thanks to a novel reinforcement learning approach that leverages calibrated self-evaluation as a dense reward signal.

Abstract

To overcome the sparse reward challenge in reinforcement learning (RL) for agents based on large language models (LLMs), we propose Mutual Information Self-Evaluation (MISE), an RL paradigm that utilizes hindsight generative self-evaluation as dense reward signals while simultaneously calibrating them against the environmental feedbacks. Empirically, MISE enables an agent to learn autonomously from dense internal rewards supplementing sparse extrinsic signals. Theoretically, our work provides the first formal foundation for the paradigm of generative self-rewarding. We prove that utilizing hindsight self-evaluation rewards is equivalent to minimizing an objective that combines mutual information with a KL divergence term between the policy and a proxy reward policy. This theoretical insight then informs and justifies our calibration step, which actively aligns these rewards with the optimal policy. Extensive experiments show that MISE outperforms strong baselines, enabling open-source LLMs about 7B parameters to achieve performance comparable to GPT-4o on validation without expert supervision.

RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Utilizing and Calibrating Hindsight Process Rewards via Reinforcement with Mutual Information Self-Evaluation

Related Papers