May 22, 2025arXiv:2505.16787

Enter the Void - Planning to Seek Entropy When Reward is Scarce

Ashish Sundar, Chunbo Luo, Xiaoyang Wang

AI Summary

This paper introduces a hierarchical planning approach that leverages MBRL world models at inference time to actively seek informative states, improving sample efficiency. The method dynamically adjusts replanning frequency, planning horizon, and commitment to entropy-seeking based on the world model's latent predictions. Applied to Dreamer, the approach demonstrates significant sample efficiency gains across MiniWorld, Crafter, and DeepMind Control tasks compared to the base Dreamer agent.

Key Contribution

MBRL agents can dramatically improve sample efficiency by using their world models not just for training, but also for actively planning to explore informative, high-entropy states at inference time.

Abstract

Model-based reinforcement learning (MBRL) offers an intuitive way to increase the sample efficiency of model-free RL methods by simultaneously training a world model that learns to predict the future. These models constitute the large majority of training compute and time and they are subsequently used to train actors entirely in simulation, but once this is done they are quickly discarded. We show in this work that utilising these models at inference time can significantly boost sample efficiency. We propose a novel approach that anticipates and actively seeks out informative states using the world model's short-horizon latent predictions, offering a principled alternative to traditional curiosity-driven methods that chase outdated estimates of high uncertainty states. While many model predictive control (MPC) based methods offer similar alternatives, they typically lack commitment, synthesising multiple multi-step plans at every step. To mitigate this, we present a hierarchical planner that dynamically decides when to replan, planning horizon length, and the commitment to searching entropy. While our method can theoretically be applied to any model that trains its own actors with solely model generated data, we have applied it to Dreamer to illustrate the concept. Our method finishes MiniWorld's procedurally generated mazes 50% faster than base Dreamer at convergence and in only 60% of the environment steps that base Dreamer's policy needs; it displays reasoned exploratory behaviour in Crafter, achieves the same reward as base Dreamer in a third of the steps; planning tends to improve sample efficiency on DeepMind Control tasks.

Training Efficiency & Optimization World Models & Planning

Citation Metrics

Citations0

Influential citations0

References61

Year2025

VenuearXiv.org

Related Papers

Finding related papers...

Search

Enter the Void - Planning to Seek Entropy When Reward is Scarce

Related Papers