CambridgeTU MunichUniversity of Technology NurembergMar 17, 2026arXiv:2603.16978

Rewarding DINO: Predicting Dense Rewards with Vision Foundation Models

Pierre Krack, Tobias Julg, Tobias Jülg, Wolfram Burgard, Florian Walter

AI Summary

This paper introduces Rewarding DINO, a language-conditioned reward modeling approach that learns dense reward functions from visual data for robot manipulation tasks. The model is trained using a rank-based loss on data from 24 Meta-World+ tasks, enabling it to predict rewards based on task semantics rather than specific trajectories. Results demonstrate competitive performance on training tasks and generalization to novel simulated and real-world settings, showcasing its ability to learn meaningful reward functions.

Key Contribution

Forget hand-engineered reward functions: Rewarding DINO learns dense, generalizable rewards for robot manipulation directly from visual data, opening the door to more autonomous skill acquisition.

Abstract

Well-designed dense reward functions in robot manipulation not only indicate whether a task is completed but also encode progress along the way. Generally, designing dense rewards is challenging and usually requires access to privileged state information available only in simulation, not in real-world experiments. This makes reward prediction models that infer task state information from camera images attractive. A common approach is to predict rewards from expert demonstrations based on visual similarity or sequential frame ordering. However, this biases the resulting reward function towards a specific solution and leaves it undefined in states not covered by the demonstrations. In this work, we introduce Rewarding DINO, a method for language-conditioned reward modeling that learns actual reward functions rather than specific trajectories. The model's compact size allows it to serve as a direct replacement for analytical reward functions with comparatively low computational overhead. We train our model on data sampled from 24 Meta-World+ tasks using a rank-based loss and evaluate pairwise accuracy, rank correlation, and calibration. Rewarding DINO achieves competitive performance in tasks from the training set and generalizes to new settings in simulation and the real world, indicating that it learns task semantics. We also test the model with off-the-shelf reinforcement learning algorithms to solve tasks from our Meta-World+ training set.

Computer Vision RLHF & Preference Learning Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References40

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Rewarding DINO: Predicting Dense Rewards with Vision Foundation Models

Related Papers