DFKIHessian.AIOxfordResearch Department SAIROLTU DarmstadtZuse School ELIZAJun 1, 2026arXiv:2606.02194

Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

Christian Scherer, Joe Watson, Theo Gruner, Daniel Palenicek, Ingmar Posner, Jan Peters

AI Summary

This paper explores the use of inverse reinforcement learning (IRL) to enhance the finetuning of large behavior models for robotic control, particularly in sparse reward environments. By employing coherent imitation learning, the authors develop a method that learns a dense reward function from expert demonstrations, leading to improved sample efficiency compared to traditional reinforcement learning approaches. The results show that their method maintains or enhances performance across six manipulation tasks, achieving over 90% success on five of them, thus addressing the limitations of RL in optimizing behavior with sparse rewards.

Key Contribution

Learning dense rewards from expert demonstrations allows for over 90% success in complex manipulation tasks, outperforming traditional RL methods.

Abstract

Distilling expert demonstration data into large generative models using behavioral cloning is a scalable approach to learning capable policies for robotic control, particularly for dexterous manipulation. Reinforcement learning (RL) can be used as a means to finetune these policies further using additional experience. An open question is whether RL is more sample-efficient than collecting more human demonstrations. Prior work has finetuned large pretrained policies in a scalable fashion by applying RL to a smaller residual policy that corrects the pretrained model. However, for the typical sparse reward tasks, RL algorithms can struggle to optimize the behavior in a sample-efficient manner. We explore inverse reinforcement learning, where a dense reward function is learned from expert demonstrations, potentially reducing the challenge of RL finetuning. We specifically consider coherent imitation learning, an IRL method that facilitates improvement of the BC policy through using a specific reward formulation with theoretical guarantees. We show that our IRL method maintains or improves the performance of pi-0.5 on all six sparse manipulation tasks and achieves a $\geq 90\%$ success rate on five out of six complex manipulation tasks, outperforming RL-based baselines using sparse rewards. By ensuring our initial pretrained finetuning policy is optimal for our initial reward and critic, our method circumvents the initial drop commonly seen in RL finetuning and enables faster improvement.

RLHF & Preference Learning Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

Related Papers