EPITAMBZUAIFeb 18, 2026arXiv:2602.16274

Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

Rahul Singh, Siddharth Chandak, Eric Moulines, Vivek S. Borkar, Nicholas Bambos

AI Summary

This paper provides the first high-probability regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes without using optimism or bonus terms. They analyze Boltzmann Q-learning, showing regret depends on the suboptimality gap, and propose a Smoothed $\epsilon_n$-Greedy exploration scheme with a near-$\tilde{O}(N^{9/10})$ gap-robust regret bound. The analysis relies on a novel high-probability concentration bound for contractive Markovian stochastic approximation with iterate- and time-dependent transition dynamics, where the contraction factor depends on the mixing time.

Key Contribution

Q-learning regret bounds can be achieved without optimism, but are highly sensitive to the suboptimality gap, motivating a new smoothed exploration strategy.

Abstract

We present the first high-probability regret bound for classical online Q-learning in infinite-horizon discounted Markov decision processes, without relying on optimism or bonus terms. We first analyze Boltzmann Q-learning with decaying temperature and show that its regret depends critically on the suboptimality gap of the MDP: for sufficiently large gaps, the regret is sublinear, while for small gaps it deteriorates and can approach linear growth. To address this limitation, we study a Smoothed $ε_n$-Greedy exploration scheme that combines $ε_n$-greedy and Boltzmann exploration, for which we prove a gap-robust regret bound of near-$\tilde{O}(N^{9/10})$. To analyze these algorithms, we develop a high-probability concentration bound for contractive Markovian stochastic approximation with iterate- and time-dependent transition dynamics. This bound may be of independent interest as the contraction factor in our bound is governed by the mixing time and is allowed to converge to one asymptotically.

Training Efficiency & Optimization

Citation Metrics

Citations0

Influential citations0

References0

Year2026

VenueN/A

Related Papers

Finding related papers...

Search

Regret and Sample Complexity of Online Q-Learning via Concentration of Stochastic Approximation with Time-Inhomogeneous Markov Chains

Related Papers