TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

Shirui Chen, Shirui Chen, Cole Harrison, Cole Harrison, Ying-Chun Lee, Ying-Chun Lee, Angela Jin Yang, Zhongzheng Ren, Lillian J. Ratliff, Jiafei Duan, Dieter Fox, Dieter Fox, Ranjay Krishna, Ranjay Krishna

AI Summary

The paper introduces TOPReward, a novel temporal value function for robotics that leverages pretrained video Vision-Language Models (VLMs) to estimate task progress by extracting information directly from the VLM's internal token logits. This approach addresses the limitations of existing methods that struggle with generalization and numerical misrepresentation when prompting VLMs for progress values. TOPReward achieves a 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL across 130+ real-world tasks, significantly outperforming the state-of-the-art GVL baseline.

Key Contribution

Unlock robot learning with hidden knowledge: TOPReward extracts surprisingly accurate task progress signals directly from VLM token probabilities, bypassing the need for explicit reward engineering.

Abstract

While Vision-Language-Action (VLA) models have seen rapid progress in pretraining, their advancement in Reinforcement Learning (RL) remains hampered by low sample efficiency and sparse rewards in real-world settings. Developing generalizable process reward models is essential for providing the fine-grained feedback necessary to bridge this gap, yet existing temporal value functions often fail to generalize beyond their training domains. We introduce TOPReward, a novel, probabilistically grounded temporal value function that leverages the latent world knowledge of pretrained video Vision-Language Models (VLMs) to estimate robotic task progress. Unlike prior methods that prompt VLMs to directly output progress values, which are prone to numerical misrepresentation, TOPReward extracts task progress directly from the VLM's internal token logits. In zero-shot evaluations across 130+ distinct real-world tasks and multiple robot platforms (e.g., Franka, YAM, SO-100/101), TOPReward achieves 0.947 mean Value-Order Correlation (VOC) on Qwen3-VL, dramatically outperforming the state-of-the-art GVL baseline which achieves near-zero correlation on the same open-source model. We further demonstrate that TOPReward serves as a versatile tool for downstream applications, including success detection and reward-aligned behavior cloning.

Multimodal Models RLHF & Preference Learning Robotics & Embodied AI

Citation Metrics

Citations0

Influential citations0

References55

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

TOPReward: Token Probabilities as Hidden Zero-Shot Rewards for Robotics

Related Papers