Search papers, labs, and topics across Lattice.
This paper studies online learning in Tree Markov Decision Problems (T-MDPs) by framing policy selection as a bandit problem. The key innovation lies in designing confidence bounds that allow bandit algorithms like LUCB and UCB to be implemented with polynomial memory and computation, despite the exponential number of policies. The approach yields instance-dependent bounds on sample complexity and regret, outperforming existing methods in hidden-information games.
Exponentially many policies in Tree MDPs don't have to mean exponential computation: clever confidence bounds let you treat policy selection as a tractable bandit problem.
A Tree Markov Decision Problem (T-MDP) is a finite-horizon MDP with a starting state $s_{1}$, in which every state is reachable from $s_{1}$ through exactly one state-action trajectory. T-MDPs arise naturally as abstractions of decision making in sequential games with perfect recall, against stationary opponents. We consider the problem of on-line learning in T-MDPs, both in the PAC and the regret-minimisation regimes. We show that well-known bandit algorithms -- \textsc{Lucb} and \textsc{Ucb} -- can be applied on T-MDPs by treating each policy as an arm. The apparent technical challenge in this approach is that the number of policies is exponential in the number of states. Our main innovation is in the design of confidence bounds based on data shared by the policies, so that the bandit algorithms can yet be implemented with polynomial memory and per-step computation. We obtain instance-dependent upper bounds on sample complexity and regret that sum a ``gap term'' from every terminal state, rather than every policy. Empirically, our algorithms consistently outperform available alternatives on a suite of hidden-information games.