UIUCFeb 24, 2026arXiv:2602.21158

SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

Dengjia Zhang, Dengjia Zhang, Xiaoou Liu, Lu Cheng, Lu Cheng, Yaqing Wang, Kenton Murray, Kenton Murray, Hua Wei

AI Summary

The paper introduces SELAUR, a reinforcement learning framework for LLM agents that incorporates the LLM's intrinsic uncertainty into the reward design. SELAUR uses entropy, least-confidence, and margin-based metrics to estimate token-level uncertainty, providing dense, confidence-aligned supervision. Experiments on ALFWorld and WebShop demonstrate that SELAUR improves success rates compared to strong baselines by using uncertainty signals to enhance exploration and robustness.

Key Contribution

LLMs learn faster and perform better in decision-making tasks when rewarded for being uncertain, not just for succeeding.

Abstract

Large language models (LLMs) are increasingly deployed as multi-step decision-making agents, where effective reward design is essential for guiding learning. Although recent work explores various forms of reward shaping and step-level credit assignment, a key signal remains largely overlooked: the intrinsic uncertainty of LLMs. Uncertainty reflects model confidence, reveals where exploration is needed, and offers valuable learning cues even in failed trajectories. We introduce SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards, a reinforcement learning framework that incorporates uncertainty directly into the reward design. SELAUR integrates entropy-, least-confidence-, and margin-based metrics into a combined token-level uncertainty estimate, providing dense confidence-aligned supervision, and employs a failure-aware reward reshaping mechanism that injects these uncertainty signals into step- and trajectory-level rewards to improve exploration efficiency and learning stability. Experiments on two benchmarks, ALFWorld and WebShop, show that our method consistently improves success rates over strong baselines. Ablation studies further demonstrate how uncertainty signals enhance exploration and robustness.

RLHF & Preference Learning Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References60

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

SELAUR: Self Evolving LLM Agent via Uncertainty-aware Rewards

Related Papers