Apr 15, 2026arXiv:2604.13504

Chain of Uncertain Rewards with Large Language Models for Reinforcement Learning

AI Summary

This paper introduces Chain of Uncertain Rewards (CoUR), a framework that leverages LLMs to streamline reward function design in RL by quantifying code uncertainty and reusing relevant reward function components. CoUR uses a similarity selection mechanism combining textual and semantic analyses to identify and reuse reward components, reducing redundant evaluations. Experiments across nine IsaacGym environments and 20 Bidexterous Manipulation tasks show CoUR achieves better performance with significantly lower reward evaluation costs.

Key Contribution

LLMs can slash the cost of reward function design in RL while simultaneously boosting performance, thanks to a novel framework that reuses and optimizes reward components.

Abstract

Designing effective reward functions is a cornerstone of reinforcement learning (RL), yet it remains a challenging and labor-intensive process due to the inefficiencies and inconsistencies inherent in traditional methods. Existing methods often rely on extensive manual design and evaluation steps, which are prone to redundancy and overlook local uncertainties at intermediate decision points. To address these challenges, we propose the Chain of Uncertain Rewards (CoUR), a novel framework that integrates large language models (LLMs) to streamline reward function design and evaluation in RL environments. Specifically, our CoUR introduces code uncertainty quantification with a similarity selection mechanism that combines textual and semantic analyses to identify and reuse the most relevant reward function components. By reducing redundant evaluations and leveraging Bayesian optimization on decoupled reward terms, CoUR enables a more efficient and robust search for optimal reward feedback. We comprehensively evaluate CoUR across nine original environments from IsaacGym and all 20 tasks from the Bidexterous Manipulation benchmark. The experimental results demonstrate that CoUR not only achieves better performance but also significantly lowers the cost of reward evaluations.

Reasoning & Chain-of-Thought RLHF & Preference Learning Robotics & Embodied AI Tool Use & Agents

Citation Metrics

Citations0

Influential citations0

References20

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Chain of Uncertain Rewards with Large Language Models for Reinforcement Learning

Related Papers