Feb 18, 2026arXiv:2602.16363

Improved Bounds for Reward-Agnostic and Reward-Free Exploration

AI Summary

This paper presents a new algorithm for reward-agnostic exploration in episodic finite-horizon MDPs that relaxes the restrictive accuracy parameter ε requirement of prior work. The algorithm uses online learning with designed rewards to construct an exploration policy, enabling accurate dynamics estimation and subsequent computation of an ε-optimal policy. The paper also establishes a tight lower bound for reward-free exploration, resolving the gap between existing upper and lower bounds.

Key Contribution

Achieves significantly improved sample complexity for reward-agnostic exploration by relaxing restrictive accuracy parameter requirements.

Abstract

We study reward-free and reward-agnostic exploration in episodic finite-horizon Markov decision processes (MDPs), where an agent explores an unknown environment without observing external rewards. Reward-free exploration aims to enable $ε$-optimal policies for any reward revealed after exploration, while reward-agnostic exploration targets $ε$-optimality for rewards drawn from a small finite class. In the reward-agnostic setting, Li, Yan, Chen, and Fan achieve minimax sample complexity, but only for restrictively small accuracy parameter $ε$. We propose a new algorithm that significantly relaxes the requirement on $ε$. Our approach is novel and of technical interest by itself. Our algorithm employs an online learning procedure with carefully designed rewards to construct an exploration policy, which is used to gather data sufficient for accurate dynamics estimation and subsequent computation of an $ε$-optimal policy once the reward is revealed. Finally, we establish a tight lower bound for reward-free exploration, closing the gap between known upper and lower bounds.

Robotics & Embodied AI World Models & Planning

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Improved Bounds for Reward-Agnostic and Reward-Free Exploration

Related Papers