Kernel-based methods for convex bandits, part 3

(This post absolutely requires to have read Part 1 and Part 2.) A key assumption at the end of Part 1 was that, after rescaling space so that the current exponential weights distribution is isotropic, one has

(1)

for some . Effectively this means that the estimated cumulative loss outside of the ball is infinite (recall that is proportional to ). Thus to enforce (1) (at time ) we will actually set the loss estimate to on . The price one pays for this is that the loss estimator is now unbiased only on , which in turn means that we control the cumulative regret with respect to points in only. We believe that in the so-called stochastic setting where the loss sequence is i.i.d. one should be able to prove that the minimizer of the expected loss function remains in at all round (for ) which would imply with the calculations from Part 1 and Part 2 a cumulative regret of order (note: we were not able to prove this and I think it is a great and accessible open problem). However this will certainly not be true in the general as one could imagine an adversary that makes us zoom in on a small portion of space using the first rounds and then move out the optimum very far from this region for the remaining rounds. To solve this issue we introduce a restart condition: essentially if the algorithm believes that the optimum might be outside the region then it restarts (i.e. it plays as if the next round was the first one in a game with time horizon ). We will modify the algorithm so we can ensure that when a restart occurs the algorithm actually had negative regret, and thus in the final regret bound we can ignore everything that happened before the last restart. The definition of the focus region given in (1) will not quite work, and in fact we will construct a region which verifies

Furthermore we will need to take which in turn forces us to take (indeed recall that in Part 1 we showed that the magnitude of the estimated loss is essentially where instead of comes from the above display, and furthermore in Part 2 we explained that with the Gaussian core one needs to take to be times smaller than the predicted value from Part 1) and hence the final regret bound (indeed recall that we got a bound of where the first comes from the fact that we need to take a learning rate times smaller than the optimal learning rate to ensure approximate log-concavity of the exponential weights, and the comes from the relative entropy distance to the optimum at the beginning).

We will use extensively the following result:

Lemma: Let be an isotropic log-concave measure. Then for any such that one has .

Restart condition

The simplest idea to restart goes as follows. Let us consider some (with defined as in (1)). Why is it there? Well there must be some time in the past where we estimated the cumulative regret of to be at least . In particular if this now has a small estimated regret, say smaller than then we can reasonably believe that something weird is going on, for instance the adversary could be trying to move the optimum outside of the current region of focus . Thus our restart condition looks like this: restart if

A key observation (which is the real reason to introduce the above restart condition) is that by convexity of the losses, by the concentration property of Part 2, and by taking the constant to be large enough, we know that the optimum (over all of ) of the “true” cumulative loss is still in at the time when we restart.

Hoping for negative regret

If we catch the adversary’s presumptive attempt to move the optimum out of quickly enough we can hope to get negative regret. Indeed we initially zoomed in on region for a good reason, and thus if we compare ourselves to the point which triggers the restart we have accumulated a lot of negative regret during this zooming procedure (since was performing badly during this time). Mathematically this shows up in the following term which appears in the regret bound calculation, where is the time at which enters the boundary of the focus region,

(2)

Roughly (essentially thanks to the Lemma above) we should expect this quantity to be as small as for , and thus this term divided by (which is then ) could easily compensate the variance term which is . This sounds great, but unfortunately the difference of entropy in (2) does not appear in our current regret bound (because of the telescopic sum of entropies). However it is easy to see that if at time step we had increased the learning rate from to then we would have the term multiplied by in the final regret bound and thus we would only need to have to ensure negative regret (note that compensating the entropy at the beginning is more difficult than compensating the variance term because of our choice of ). Let us make this slightly more formal.

Turning up the learning rate

Let us assume that we update the exponential weights distribution with a time-dependent learning rate as follows: . One can see with the same calculations than in Part 1 that:

In particular if we increase the learning rate at some times by then we have

Using that the normalizing constants are all larger than (this can be proved using the fact that covariance of the exponential weights is at least some ) and assuming that (i.e. ) one gets roughly, with ,

We see that by taking we can guarantee negative regret with respect to any whose first time at which it belonged to the boundary of the focus region is also a time at which we increased the learning rate (i.e., one of the times ). Thus we see that we should take if we want to have a regret of order . Recall also that we can afford to update the learning rate only times. Thus the next idea is to update the focus region only when it is necessary. We will see that we can take (and thus , ) while guaranteeing that the focus region satisfies

As we explained at the beginning of this post this will conclude the proof of the regret bound.

Updating the focus region more carefully

We update the focus region only at times such that, once space is rescaled so that is isotropic,

and in this case we set

Somewhat clearly we won’t make more than updates (since after that many updates the focus region is really tiny) so the only thing to verify is that one always has (whether we update or not)

This follows from:

Lemma: Let be a convex body and be the centered unit ball. Suppose that . Then .

Proof: Let us prove the contrapositive and assume that there is a point with . Denote , and consider the sets .

Note that those sets are disjoint. Indeed, the intervals are disjoint, which implies that the projections of the ellipsoids onto the span of are disjoint. So, we have

which concludes the proof.

That’s it folks!

This concludes the informal proof! In the real proof we unfortunately need to deal with the various places where we said “events of probability do not matter”. There is also the discrepancy between approximately log-concave and exactly log-concave measure. Finally doing the concentration argument properly (with the induction) is one more place that add some unwanted (but relatively minor) technical complications.

Another important part which I completely omitted in these lecture notes is the computational complexity of the resulting algorithm. Using known results on sampling from approximately log-concave measures it is fairly easy to sample in time . Checking whether the focus region has to be updated can also be done in poly time. The only real difficulty is checking the restart condition which asks to minimize an approximately convex function on the boundary of a convex set! The trick here is that one can replace ellipses by boxes, and thus we are left with minimizing an approximately convex function on convex sets (which are dimensional). All of this can be found in the paper of course!