Low Regret Algorithms for Multi-armed Bandits and MDPs

Ambuj Tewari

UC Berkeley

Abstract

While taking actions in an unknown probabilistic environment one
faces the well-known exploration vs. exploitation trade-off. On the one
hand, we would like to exploit our current knowledge of the environment to
choose actions that have been rewarding in the past. On the other hand, we
would like to explore actions that have not been taken before as they
might be potentially rewarding. The notion of regret measures how much
worse we perform compared to someone who has complete knowledge of the
environment. Achieving low regret in a rich class of environments can be
taken to be a sign of successfully making the exploration vs. exploitation
trade-off.

In this talk, I will describe algorithms for multi-armed bandit problems
and Markov decision processes (MDPs) that achieve low regret over time.
Even when started with absolutely no knowledge of the environment they are
faced with, these algorithms incur regret that grows only logarithmically
with time.

The `optimism in the face of uncertainty' heuristic is a common feature of
all these algorithms. After reviewing some known bandit algorithms, I will
show how the heuristic can be used to derive a new algorithm, called
Optimistic Linear Programming (OLP), that achieves low regret in any
irreducible MDP. OLP is closely related to an algorithm proposed by
Burnetas and Katehakis with four key differences: OLP is simpler, it does
not require knowledge of the supports of transition probabilities, the
proof of the regret bound is simpler, but its regret bound is a constant
factor larger than the regret of their algorithm. OLP is also similar in
flavor to an algorithm recently proposed by Auer and Ortner. But OLP is
simpler and its regret bound has a better dependence on the size of the
MDP.