Papers
arxiv:2205.07704

From Dirichlet to Rubin: Optimistic Exploration in RL without Bonuses

Published on May 16, 2022
Authors:
,
,
,
,
,
,

Abstract

Bayes-UCBVI algorithm extends Bayes-UCB for reinforcement learning, achieving optimal regret bounds without Bernstein-like bonuses in tabular, episodic Markov decision processes.

AI-generated summary

We propose the Bayes-UCBVI algorithm for reinforcement learning in tabular, stage-dependent, episodic Markov decision process: a natural extension of the Bayes-UCB algorithm by Kaufmann et al. (2012) for multi-armed bandits. Our method uses the quantile of a Q-value function posterior as upper confidence bound on the optimal Q-value function. For Bayes-UCBVI, we prove a regret bound of order O(H^3SAT) where H is the length of one episode, S is the number of states, A the number of actions, T the number of episodes, that matches the lower-bound of Ω(H^3SAT) up to poly-log terms in H,S,A,T for a large enough T. To the best of our knowledge, this is the first algorithm that obtains an optimal dependence on the horizon H (and S) without the need for an involved Bernstein-like bonus or noise. Crucial to our analysis is a new fine-grained anti-concentration bound for a weighted Dirichlet sum that can be of independent interest. We then explain how Bayes-UCBVI can be easily extended beyond the tabular setting, exhibiting a strong link between our algorithm and Bayesian bootstrap (Rubin, 1981).

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2205.07704 in a model README.md to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2205.07704 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.