ORF523: Optimization with bandit feedback

In this last lecture we consider the case where one can only access the function with a noisy -order oracle (see this lecture for a definition). This assumption models the ‘physical reality’ of many practical problems (on the contrary to the case of a -order oracle which was essentially used as an ‘approximation’). Indeed there are many situations where the function to be optimized is not known, but one can simulate the function. Imagine for example that we are trying to find the correct dosage of a few chemical elements in order to produce the most effective drug. We can imagine that given specific dosage one can produce the corresponding drug, and then test it to estimate its effectiveness. This corresponds precisely to assuming that we have a noisy -order oracle for the function .

As one can see with the above example, the type of applications for -order optimization are fundamentally different from the type of applications for -order optimization. In particular in -order optimization the function is known (we need to compute gradients), and thus it must have been directly engineered by humans (who wrote down the function!). This observation shows that in these applications one has some flexibility in the choice of , and since we know that (roughly speaking) convex functions can be optimized efficiently, one has a strong incentive into engineering convex functions. On the other hand in -order optimization the function is produced by ‘Nature’. As it turns out, Nature does not care that we can only solve convex problems, and in most applications of -order optimization I would argue that it does not make sense to make a convexity assumption.

As you can imagine -order optimization appear in various contexts, from experiments in physics or biology, to the design of Artificial Intelligence for games, and the list goes on. In particular different communities have been looking at this problem, with different terminology and different assumptions. Being completely biased by my work on the multi-armed bandit problem, I believe that the bandit terminology is the nicest one to describe -order optimization and I will now refer to it as optimization with bandit feedback (for more explanation on this you can check my survey with Cesa-Bianchi on bandit problems). In order to simplify the discussion we will focus in this lecture on optimization over a finite set, that is (we give some references on the continuous case at the end of the post).

Discrete optimization with bandit feedback

We want to solve the problem:

where the ‘s are unknown quantities on which we can obtain information as follows: if one makes a query to action (we will also say that one pulls arm ), then one receives an independent random variable such that . We will also assume that the distribution of is subgaussian (for example this include Gaussian distributions with variance less than and distributions supported on an interval of length less than ). The precise definition is not important, we will only use the fact that if one receives from pullings times arm then satisfies:

(1)

For sake of simplicity we assume that . An important parameter in our analysis is the suboptimality gap for arm : (we also denote ).

We will be interested in two very different models:

In the PAC (Probably Approximately Correct) setting the algorithm can make as many queries as necessary, so that when it stops querying it can output an arm such that where and are prespecified accuracies. We denote by the number of queries that the algorithm made before stopping, and the objective is to have -PAC algorithms with as small as possible. This formulation goes back to Bechhofer (1954).

In the fixed budget setting the algorithm can make at most queries, and then it has to output an arm . The goal here is to minimize the optimization error . Strangely this formulation is much more recent: it was proposed in this paper that I wrote with Munos and Stoltz.

While at first sight the two models might seem similar, we will see that in fact there is a key fundamental difference.

Trivial algorithms

A trivial -PAC algorithm would be to query each arm of order of times and then output the empirical best arm. Using (1) it is obvious that this algorithm is indeed -PAC, and furthermore it satisfies .

In the fixed budget setting a trivial algorithm is to divide the budget evenly among the arms. Using (1) it is immediate that this strategy satisfies , which equivalently states that to have one needs the budget to be at least of order of .

We will now see that these trivial algorithms can be dramatically improved by taking into account the potential heterogeneity in the ‘s. For sake of simplicty we focus now on finding the best arm , that is in the PAC setting we take , and in the fixed budget setting we consider . The critical quantity will be the hardness measure:

The Successive Elimination algorithm was proposed in this paper by Even-Dar, Mannor and Mansour. The idea is very simple: start with a set of active arms . At each round , pull once each arm in . Now construct confidence intervals of size for each arm, and build from by eliminating arms in for which the confidence interval does not overlap with the confidence interval of the currently best empirical arm in . The algorithm stops when , and it outputs the single element of . Using an union bound it is an easy exercise to verify that this algorithm is -PAC, and furthermore with probability at least one has .

Successive Rejects (SR) for the fixed budget setting

The Successive Elimination algorithm cannot be easily adapted in the fixed budget setting. The reason is that in the fixed budget framework we do not know a priori what is a reasonable value for the confidence parameter . Indeed in the end an optimal algorithm should have a probability of error of order , which depends on the unknown hardness parameter . As a result one cannot use strategy based on confidence intervals in the fixed budget setting unless one knows (note that estimating from data is basically as hard as finding the best arm). With Audibert and Munos we proposed an alternative to SE for the fixed budget that we called Successive Rejects.

The idea is to divide the budget into chunks such that . The algorithm then runs in phases. Let be the set of active arms in phase (with ). In phase each arm in is sampled times, and the end of the phase the arm with the worst empirical performance in is rejected, that is . The output of the algorithm is the unique element of .

Let be the ordered ‘s. Remark that in phase , one of the worst arms must still be in the active set, and thus using a trivial union bound one obtains that:

Now the key observation is that by taking proportional to one obtains a bound of the form . Precisely let where is such that the ‘s sum to (morally is of order ). Then we have

for some numerical constant . Thus we proved that SR satisfies

, which equivalently states that SR finds the best arm with probability at least provided that the budget is at least of order .

The continuous case

Many questions are still open in the continuous case. As I explained at the beginning of the post, convexity might not be the best assumption from a practical point of view, but it is nonetheless a very natural mathematical problem to try to understand the best rate of convergence in this case. This is still an open problem, and you can see the state of the art for upper bounds in this paper by Agarwal, Foster, Hsu, Kakade and Rakhlin, and for lower bounds in this paper by Shamir. In my opinion a more ‘practical’ assumption on the function is simply to assume that it is Lipschitz in some metric. The HOO (Hierarchical Optimistic Optimization) algorithm attains interesting performances when the metric is known, see this paper by myself, Munos, Stoltz and Szepesvari (similar results were obtained independently by Kleinberg, Slivkins and Upfal in this paper). Recently progress has been made for the case where the metric is unknown, see this paper by Slivkins, this paper by Munos, and this one by Bull. Finally let me remark that this Lipschitzian assumption is very weak, and thus one cannot hope to solve high-dimensional problem with the above algorithms. In fact, these algorithms are designed for small dimensional problems (say dimension less than or so). In standard optimization one can solve problems in very large dimension because of the convexity assumption. For -order optimization I think that we still don’t have a natural assumption that would allow us to scale-up the algorithms to higher dimensional problems.