This is an umbrella project for several related efforts at Microsoft
Research Silicon Valley that address various Multi-Armed Bandit (MAB)
formulations motivated by web search and ad placement. The MAB problem
is a classical paradigm in Machine Learning in which an online algorithm
chooses from a set of strategies in a sequence of trials so as to
maximize the total payoff of the chosen strategies.

The
name "multi-armed bandits" comes from a whimsical scenario in which a
gambler faces several slot machines, a.k.a. "one-armed bandits", that
look identical at first but produce different expected winnings. The
crucial issue here is the trade-off between acquiring new information (exploration)
and capitalizing on the information available so far (exploitation).
While the MAB problems have been studied extensively in Machine
Learning, Operations Research and Economics, many exciting questions are
open. One aspect that we are particularly interested in
concerns modeling and efficiently using various types of side
information that may be available to the algorithm.

MAB problems with similarity information

Multi-armed bandits in metric spaces
Robert Kleinberg, Alex Slivkins and Eli Upfal (STOC
2008)AbstractWe
introduce a version of the stochastic MAB problem, possibly with
a very large set of arms, in which the expected payoffs obey a
Lipschitz condition with respect to a given metric space. The
goal is to minimize regret as a function of time, both in the
worst case and for 'nice' problem instances.

Sharp dichotomies for regret minimization in metric spaces
Robert Kleinberg and Alex Slivkins (SODA
2010)AbstractWe
focus on the connections between online learning and metric
topology. The main result that the worst-case regret is either O(log
t) or at least sqrt{t}, depending on whether the completion of the
metric space is compact and countable. We prove a number of other
dichotomy-style results, and extend them to the full-feedback
(experts) version.

Contextual bandits with similarity
information
Alex Slivkins (COLT
2011)AbstractIn
the 'contextual bandits' setting, in each round nature reveals a
'context' x, algorithm chooses an 'arm' y, and the expected payoff
is µ(x,y). Similarity info is expressed by a metric space over the
(x,y) pairs such that µ is a Lipschitz function. Our algorithms are
based on adaptive (rather than uniform) partitions of the metric
space which are adjusted to the popular and high-payoff regions.

Multi-armed bandits on implicit metric
spaces
Alex Slivkins (NIPS
2011)AbstractSuppose
an MAB algorithm is given a tree-based classification of arms. This
tree implicitly defines a "similarity distance" between arms, but
the numeric distances are not revealed to the algorithm. Our
algorithm (almost) matches the best known guarantees for the setting
(Lipschitz MAB) in which the distances are revealed.

MAB problems in a changing environment

Adapting to a stochastically changing
environment
Alex Slivkins and Eli Upfal (COLT
2008)AbstractWe
study a version of the stochastic MAB problem in which the expected
reward of each arm evolves stochastically and gradually in time,
following an independent Brownian motion or a similar process. Our
benchmark is a hypothetical policy that chooses the best arm in each
round.

Adapting to the Shifting Intent of Search
Queries
Umar Syed, Alex Slivkins and Nina Mishra (NIPS
2009)AbstractQuery
intent may shift over time. A classifier can use the available
signals to predict a shift in intent. Then a bandit algorithm can be
used to find the new relevant results. We present a meta-algorithm
that combines such classifier with a bandit algorithm in a feedback
loop.

Contextual bandits with similarity
information
Alex Slivkins (COLT
2011)AbstractInterpreting
the current time as a part of the contextual information, we obtain
a very general bandit framework that (in addition to similarity
between arms and contexts) can include slowly changing payoffs and
variable sets of arms.

Explore-exploit tradeoff in mechanism design

Characterizing truthful multi-armed bandit mechanismsMoshe Babaioff, Alex Slivkins and Yogi Sharma (EC
2009)AbstractWe
consider a multi-round auction setting motivated by
pay-per-click auctions in the Internet advertising, which can be
viewed as a strategic version of the MAB problem. We investigate
how the design of MAB algorithms is affected by the restriction
of truthfulness. We show striking differences in terms of both
the structure of an algorithm and its regret.

Risk vs. reward trade-off in MAB

Prediction Strategies without loss
Michael Kapralov and Rina Panigrahy (NIPS
2011)AbstractWe
show that it is theoretically possible to extract some reward in a
bandit prediction game while having an exponentially small downside
risk.