Computer Science > Machine Learning

Title:Weighted bandits or: How bandits learn distorted values that are not expected

Abstract: Motivated by models of human decision making proposed to explain commonly
observed deviations from conventional expected value preferences, we formulate
two stochastic multi-armed bandit problems with distorted probabilities on the
cost distributions: the classic $K$-armed bandit and the linearly parameterized
bandit. In both settings, we propose algorithms that are inspired by Upper
Confidence Bound (UCB), incorporate cost distortions, and exhibit sublinear
regret assuming \holder continuous weight distortion functions. For the
$K$-armed setting, we show that the algorithm, called W-UCB, achieves
problem-dependent regret $O(L^2 M^2 \log n/ Δ^{\frac{2}α-1})$,
where $n$ is the number of plays, $Δ$ is the gap in distorted expected
value between the best and next best arm, $L$ and $α$ are the Hölder
constants for the distortion function, and $M$ is an upper bound on costs, and
a problem-independent regret bound of
$O((KL^2M^2)^{α/2}n^{(2-α)/2})$. We also present a matching lower
bound on the regret, showing that the regret of W-UCB is essentially
unimprovable over the class of Hölder-continuous weight distortions. For
the linearly parameterized setting, we develop a new algorithm, a variant of
the Optimism in the Face of Uncertainty Linear bandit (OFUL) algorithm called
WOFUL (Weight-distorted OFUL), and show that it has regret $O(d\sqrt{n} \;
\mbox{polylog}(n))$ with high probability, for sub-Gaussian cost distributions.
Finally, numerical examples demonstrate the advantages resulting from using
distortion-aware learning algorithms.

Comments:

Longer version of the paper to be published as part of the proceedings of AAAI 2017