Search papers, labs, and topics across Lattice.
This paper introduces a unified framework for analyzing the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning. It formalizes distributional regret bounds as probabilistic guarantees uniform across confidence levels, offering a comprehensive characterization of regret distribution. The authors propose a UCBVI-style algorithm with a novel exploration bonus and derive general gap-independent and gap-dependent distributional regret bounds, achieving optimal trade-offs between expected and distributional regret.
Distributional regret bounds, which quantify the probability of exceeding different regret levels, are now achievable with a UCBVI-style algorithm, confirming a long-standing conjecture for multi-armed bandits.
We study the distribution of regret in stochastic multi-armed bandits and episodic reinforcement learning through a unified framework. We formalize a distributional regret bound as a probabilistic guarantee that holds uniformly over all confidence levels $未\in (0,1]$, thereby characterizing the regret distribution across the full range of $未$. We present a simple UCBVI-style algorithm with exploration bonus $\min\{c_{1,k}/N, c_{2,k}/\sqrt{N}\}$, where $N$ denotes the visit count and $(c_{1,k},c_{2,k})$ are user-specified parameters. For arbitrary parameter sequences, we derive general gap-independent and gap-dependent distributional regret bounds, yielding a principled characterization of how the parameters control the trade-off between expected performance, tail risk, and instance-dependent behavior. In particular, our bounds achieve optimal trade-offs between expected and distributional regret in both minimax and instance-dependent regimes. As a special case, for multi-armed bandits with $A$ arms and horizon $T$, we obtain a distributional regret bound of order $\mathcal{O}(\sqrt{AT}\log(1/未))$, confirming the conjecture of Lattimore & Szepesv谩ri (2020, Section 17.1) for the first time.