This work is supported by the French National Research Agency (ANR), under the project BADASS (grant coded: N ANR-16-CE40-0002), by the French Ministry of Higher Education and Research (MENESR) and ENS Paris-Saclay. Thanks to Odalric-Ambrym Maillard and Florian Strub at Inria Lille for useful discussions.

Abstract

An online reinforcement learning algorithm is anytime if it does not need to know in advance the horizon T of the experiment. A well-known technique to obtain an anytime algorithm from any non-anytime algorithm is the “Doubling Trick”. In the contexts of adversary or stochastic multi-armed bandits, the performance of an algorithm is measured by its regret, and we study two families of sequences of growing horizons (geometric and exponential) to generalize previously known results that certain doubling tricks can be used to conserve certain regret bounds. In a broad setting, we prove that a geometric doubling trick can be used to conserve (adversary) bounds in RT=O(√T) but cannot conserve (stochastic) bounds in RT=O(logT). We give insights as to why exponential doubling tricks may be better, as they conserve bounds in RT=O(logT), and are close to conserving bounds in RT=O(√T).

Multi-Armed Bandit (MAB) problems are well-studied sequential decision making problems in which an agent repeatedly chooses an action (the “arm” of a one-armed bandit) in order to maximize some total reward [22]. Initial motivation for their study came from the modeling of clinical trials, as early as 1933 with the seminal work of [25]. In this example, arms correspond to different treatments with unknown, random effect. Since then, MAB models have been proved useful for many more applications, that range from cognitive radio [14] to online content optimization (news article recommendation [19], online advertising [10] or A/B Testing [16]), or portfolio optimization [23].

While the number of patients involved in a clinical study (and thus the number of treatments to select) is often decided in advance, in other contexts the total number of decisions to make (the horizon T) is unknown. It may correspond to the total number of visitors of a website optimizing its displays for a certain period of time, or to the number of attempted communications in a smart radio device. Hence in such cases, it is crucial to devise anytime algorithms, that is algorithms that do no rely on the knowledge of this horizon T to sequentially select arms. A general way to turn any base algorithm into an anytime algorithm is the use of the so-called Doubling Trick, first proposed by [4], that consists in repeatedly running the base algorithm with increasing horizons. Motivated by the frequent use of this technique and the absence of a generic study of its effect on the algorithm’s efficiency, this paper investigates in details two families of doubling sequences (geometric and exponential), and shows that the former should be avoided for stochastic problems.

More formally, a MAB model is a set of K arms, each arm k being associated to a (unknown) reward stream(Yk,t)t∈N. Fix T a finite (possibly unknown) horizon. At each time step t∈{1,…,T} an agent selects an arm A(t)∈{1,…,K} and receives as a reward the current value of the associated reward stream, r(t):=YA(t),t. The agent’s decision strategy (or bandit algorithm) AT:=(A(t),t∈{1,…,T}) is such that A(t) can only rely on the past observations A(1),r(1),…,A(t−1),r(t−1), on external randomness and (possibly) on the knowledge of the horizon T. The objective of the agent is to find an algorithm A that maximizes the expected cumulated rewards, where the expectation is taken over the possible randomness used by the algorithm and the possible randomness in the generation of the rewards stream. In the oblivious case, in which the reward streams are independent of the algorithm’s choice, this is equivalent to minimizing the regret, defined as

RT(AT):=maxk∈{1,…,K}E[T∑t=1(Yk,t−YA(t),t)].

This quantity, referred to as pseudo-regret in [7], quantifies the difference between the expected cumulated reward of the best fixed action, and that of the strategy AT. For the general adversarial bandit problem [6], in which the rewards streams are arbitrary (picked by an adversary), a worst-case lower bound has been given. It says that for every algorithm, there exists (stochastic) reward streams such that the regret is larger than (1/20)√KT[6]. Besides, the EXP3 algorithm has been shown to have a regret of order √KTlog(K).

Much smaller regret may be obtained in stochastic MAB models, in which the reward stream from each arm k is assumed to be i.i.d., from some (unknown) distribution νk, with mean μk. In that case, various algorithms have been proposed with problem-dependent regret upper bounds of the form C(ν)log(T), where C(ν) is a constant that only depend on the arms distributions. Different assumptions on the arms distributions lead to different problem-dependent constants. In particular, under some parametric assumptions (e.g., Gaussian distributions, exponential families), asymptotically optimal algorithms have been proposed and analyzed (e.g., kl-UCB[8] or Thompson sampling [1]), for which the constant C(ν) obtained in the regret upper bound matches exactly that of the lower bound given by [17]. Under the non-parametric assumption that the νk are bounded in [0,1], the regret of the UCB1 algorithm [5] is of the above form with C(ν)=8×∑k:μk>μ∗(μ∗−μk)−1, where μ∗=maxkμk is the mean of the best arm. Like in this last example, all the available constants C(ν) become very large on “hard” instances, in which some arms are very close to the best arm. On such instances, C(ν)log(T) may be much larger than the worst-case (1/20)√KT, and distribution-independent guarantees may actually be preferred.

The MOSS algorithm, proposed by [2], is the first stochastic bandit algorithm to enjoy a problem-dependent logarithmic regret and to be optimal in a minimax sense, as its regret is proved to be upper bounded by √KT, for bandit models with rewards in [0,1]. However the corresponding constant C(ν) is proportional to K/Δmin, where Δmin=mink(μ∗−μk) is the minimal gap, which worsen the constant of UCB1. Another drawback of MOSS is that it is not anytime. These two shortcoming have been overcame recently in two different works. On the one hand, the MOSS-anytime algorithm [11] is minimax optimal and anytime, but its problem-dependent regret does not improve that of MOSS. On the other hand, the kl-UCB++ algorithm [21] is simultaneously minimax optimal and asymptotically optimal (i.e., it has the best problem-dependent constant C(ν)), but it is not anytime. A natural question is thus to know whether a Doubling Trick could overcome this limitation.

This question is the starting point of our comprehensive study of the Doubling Trick: can a single Doubling Trick be used to preserve both problem-dependent (logarithmic) regret and minimax (square-root) regret? We answer this question in the negative, by showing that two different types of Doubling Trick may actually be needed. In this paper, we investigate how algorithms enjoying regret guarantees of the generic form

∀T≥1,RT(AT)≤cTγ(log(T))δ+o(Tγ(log(T))δ)(2)

may be turned into an anytime algorithm enjoying similar regret guarantees with an appropriate Doubling Trick. This does not come for free, and we exhibit a “price of Doubling Trick”, that is a constant factor larger than 1, referred to as a constant manipulative loss.

Outline The rest of the paper is organized as follows. The Doubling Trick is formally defined in Section 2, along with a generic tool for its analysis. In Section 3, we present upper and lower bounds on the regret of algorithms to which a geometric Doubling Trick is applied. Section 4 investigates regret guarantees that can be obtained for a “faster” exponential Doubling Trick. Experimental results are then reported in Section 5. Complementary elements of proofs are deferred to the appendix.

The Doubling Trick, denoted by DT, is a general procedure to convert a (possibly non-anytime) algorithm into an anytime algorithm. It is formally stated below as Algorithm Figure 1 and depends on a non-decreasing diverging doubling sequence(Ti)i∈N (i.e., Ti→∞ for i→∞). DT fully restarts the underlying algorithm A at the beginning of each new sequence (at t=Ti+1), and run this algorithm on a sequence of length (Ti−Ti−1).

Figure 1: The Generic Doubling Trick Algorithm, A′=DT(A,(Ti)i∈N).

Related work. The Doubling Trick is a well known idea in online learning, that can be traced back to [4]. In the literature, the term Doubling Trick usually refers to the geometric sequence Ti=2i, in which the horizon is actually doubling, that was popularized by [9] in the context of adversarial bandits. Specific doubling tricks have also been used for stochastic bandits, for example in the work of [3], which uses the doubling sequence Ti=22i to turn the UCB-R algorithm into an anytime algorithm.

Elements of regret analysis. For a sequence (Ti)i∈N, with Ti∈N for all i, we denote T−1=0, and T0 is always taken non-zero, T0>0 (i.e., T0∈N∗). We only consider non-decreasing and diverging sequences (that is, ∀i,Ti+1≥Ti, and Ti→∞ for i→∞).

Definition 1: Last Term Lt For a non-decreasing diverging sequence (Ti)i∈N and T∈N, we can define LT((Ti)i∈N) by

∀T≥1,LT((Ti)i∈N):=min{i∈N:Ti>T}.

It is simply denoted LT when there is no ambiguity (e.g., if the doubling sequence is chosen).

DT(A) reinitializes its underlying algorithm A at each time Ti, and in generality the total regret is upper bounded by the regret on each sequence {Ti,…,Ti+1−1}. By considering the last partial sequence {TLT−1,…,T−1}, this splitting can be used to get a generic upper bound by taking into account a larger last sequence (up to TLT−1). And for stochastic bandit models, the i.i.d. hypothesis on the rewards streams makes the splitting on each sequence an equality, so we can also get a lower bound by excluding the last partial sequence. Lemma 1 is proved in Appendix A.1.

Lemma 1: Regret Lower and Upper Bounds for DT For any bandit model and algorithm A and horizon T, one has the generic upper bound

RT(DT(A,(Ti)i∈N))≤LT∑i=0RTi−Ti−1(ATi−Ti−1).(LB)(3)

Under a stochastic bandit model, one has furthermore the lower bound

RT(DT(A,(Ti)i∈N))≥LT−1∑i=0RTi−Ti−1(ATi−Ti−1).(UB)(4)

As one can expect, the key to obtain regret guarantees for a Doubling Trick algorithm is to choose correctly the doubling sequence (Ti)i∈N. Empirically, one can verify that sequences with slow growth gives terrible results, and for example using an arithmetic progression typically gives a linear regret. Building on this result, we will prove that if A satisfies a certain regret bound (RT=O(Tγ), O((logT)δ), or O(Tγ(logT)δ)) then an appropriate anytime version of A with a certain doubling trick can conserve the regret bound, with an explicit constant multiplicative loss ℓ>1. In this paper, we study in depth two families of sequences: first geometric and then exponential growths.

A geometric doubling sequence allows to conserve a minimax bound (i.e., RT=O(√T)). It was suggested, for instance, in [9]. We generalize this result in the following theorem, proved in Appendix A.2, by extending it to bounds of the form Tγ(logT)δ instead of just √T, for any0<γ<1 and δ≥0. Note that no distinction is done on the case δ=0 neither in the expression of the constant loss, nor in the proof.

Theorem 1 If an algorithm A satisfies RT(AT)≤cTγ(logT)δ+f(T), for 0<γ<1, δ≥0 and for c>0, and an increasing function f(t)=o(tγ(logt)δ) (at t→∞), then the anytime version A′:=DT(A,(Ti)i∈N) with the geometric sequence (Ti)i∈N of parameters T0∈N∗,b>1 (i.e., Ti=⌊T0bi⌋) with the conditionT0(b−1)>1 if δ>0, satisfies,

RT(A′)≤ℓ(γ,δ,T0,b)cTγ(logT)δ+g(T),(6)

with a increasing function g(t)=o(tγ(logt)δ), and a constant lossℓ(γ,δ,T0,b)>1,

ℓ(γ,δ,T0,b):=(log(T0(b−1)+1)log(T0(b−1)))δ×bγ(b−1)γbγ−1.(7)

For a fixed γ and δ, minimizing ℓ(γ,δ,T0,b) does not always give a unique solution. On the one hand, if γ≳0.384, there is a unique solution b∗(γ)>1 minimizing the bγ(b−1)γbγ−1 term, solution of bγ+1−2b+1=0, but without a closed form if γ is unknown. On the other hand, for any γ, the term depending on δ tends quickly to 1 when T0 increases.

Practical considerations. Empirically, when γ and δ are fixed and known, there is no need to minimize ℓ jointly. It can be minimized separately by first minimizing bγ(b−1)γbγ−1, that is by solving bγ+1−2b+1=0 numerically (e.g., with Newton’s method), and then by taking T0 large enough so that the other term is close enough to 1.

For the usual case of γ=1/2 and δ=0 (i.e., bounds in √T), the optimal choice of b is 3+√52 leading to ℓ≃3.33, and the usual choice of b=2 gives ℓ≃3.41 (see Corollary 1 in appendix). Any large enough T0 gives similar performance, and empirically T0≫K is preferred, as most algorithms explore each arm once in their first steps (e.g., T0=200 for K=9 for our experiments).

We observe that the constant loss in Eq. from the previous Theorem 1 blows up when γ goes to zero, giving the intuition that no geometric doubling trick could be used to preserve a logarithmic bound (i.e., with γ=0, δ>0). This is confirmed by the lower bound given below.

Theorem 2 For stochastic models, if A satisfies RT(AT)≥c(logT)δ, for c>0 and δ>0, then the anytime version A′:=DT(A,(Ti)i∈N) with the geometric sequence (Ti)i∈N of parameters T0∈N∗,b>1 (i.e., Ti=⌊T0bi⌋) satisfies this lower bound for a certain constant c′>0,

∀T≥1,LT≥2⟹RT(A′)≥c′(logT)δ+1.(8)

Moreover, this implies that RT(A′)=Ω((logT)δ+1), which proves that a geometric sequence cannot be used to conserve a logarithmic regret bound.

If the regret is lower bounded at finite horizon by RT(AT)≥c√T, without surprise we can prove that a geometric doubling sequence gives a comparable lower bound for the Doubling Trick algorithm DT(A) (see Th. 5 in Appendix B). But more importantly, we show with Theorem 2 that a geometric sequence cannot be used to conserve a finite-horizon lower bound like RT(AT)≥clog(T).

This special case (δ=1) is the most interesting, as efficient algorithms for stationary bandits must match the lower bound from [17]: if RT(AT) satisfies both (finite-time) logarithmic lower and upper bound (i.e., if RT(AT)logT is bounded for T large enough), then using a geometric sequence for the doubling trick algorithm is a bad idea as it guarantees a blow up in the regret lower bound, by implying RT(DT(A,(Ti)i∈N))=Ω((logT)2). This result is the reason why we need to study successive horizons growing faster than a geometric sequence (i.e., such that log(Ti)≫i), like the exponential sequence studied in next Section 4.

Let xi:=T0(b−1)bi>0. If we have T0(b−1)>1 (see below (♣) for a discussion on the other case), then Lemma 5 (Eq. ) gives log(xi−1)≥log(T0(b−1)−1)log(T0(b−1))log(xi) as xi>1. For lower bounds, there is no need to handle the constants tightly, and we have xi≥bi by hypothesis, so let call this constant c′=c(log(T0(b−1)−1)log(T0(b−1)))δ>0, and thus it simplifies to

≥c′LT−2∑i=0(log(bi))δ

A sum-integral minoration for the increasing function t↦tδ (as δ>0) gives ∑LT−2i=0(log(bi))δ=(logb)δ∑LT−2i=1iδ≥(logb)δ∫LT−20tδdt=(logb)δδ+1(LT−2)δ+1 (if LT≥2), and so

RT(A′)≥c′(logb)δδ+1(LT−2)δ+1

For the geometric sequence, we know that LT≥logb(TT0)≥logb(T), and logb(T)−2∼logb(T) at T→∞ so there exists a constant 0<c′′<1 such that LT−2≥c′′logb(T) for T large enough (≥2b−1), and such that LT≥2. And thus we just proved that there is a constant c′′′>0 such that

RT(A′)≥c′′′(logT)δ+1=:g(T).

So this proves that for T large enough, RT(A′)≥g(T) with g(T)=O((logT)δ+1), and so RT(A′)=Ω((logT)δ+1), which also implies that RT(A′)cannot be a O((logT)δ).

(♣) If we do not have the hypothesis T0(b−1)>1, the same proof could be done, by observing that from i≥i0 large enough, we have xi≥bi−i0 (as soon as bi0≥1T0(b−1)>0, i.e., i0≥⌈−logb(T0(b−1))⌉≥1), and so the same arguments can be used, to obtain a sum from i=i0+1 instead of from i=1. For a fixed i0, we also have LT−2−i0≥c′′log(T) for a (small enough) constant c′′, and thus we obtain the same result. ■

We define exponential doubling sequences, and prove that they can be used to conserve bounds in O((logT)δ), unlike the previously studied geometric sequences. Furthermore, we provide elements showing that they may also conserve bounds in O(Tγ) or O(Tγ(logT)δ).

Definition 3: Exponential Growth For a,b∈R, a,b>1 and T0∈N∗, if τ:=T0a∈R,≥0, then the sequence defined by Ti:=⌊τabi⌋ is non-decreasing and diverging, and satisfies

∀T<T0,LT=0,and∀T≥T0,LT=⌈logb(loga(Tτ))⌉.(9)

Asymptotically for i and T→∞, Ti=O(abi) and LT∼logb(loga(Tτ))=O(loglogT).

An exponential doubling sequence allows to conserve a problem-dependent bound on regret (i.e., RT=O(logT)). This was already used by [3] in a particular case, and for example more recently by [20]. We generalize this result in the following theorem.

Theorem 3 If an algorithm A satisfies RT(AT)≤cTγ(logT)δ+f(T), for 0≤γ<1, δ>0, and for c>0, and an increasing function f(t)=o(tγ(logt)δ) (at t→∞), then the anytime version A′:=DT(A,(Ti)i∈N) with the exponential sequence (Ti)i∈N of parameters T0∈N∗,a,b>1 (i.e., Ti=⌊T0aabi⌋), satisfies the following inequality,

RT(A′)≤ℓ(γ,δ,T0,a,b)c(Tb)γ(logT)δ+g(T).(10)

with an increasing function g(t)=o((tb)γ(logt)δ), and a constant lossℓ(γ,δ,T0,a,b)>0,

ℓ(γ,δ,T0,a,b):=⎧⎪⎨⎪⎩(aT0)(b−1)γb2δbδ−1>0 if δ>01+1(log(a))(log(bγ))>1 if δ=0(11)

This result is as rich as Theorem 1 but for logarithmic bounds (i.e., γ=0), with no loss but a constant ℓ≥4. It can also be applied to bounds of the generic form RT=O(Tγ(logT)δ) (i.e., with γ>0), but with a significant loss from Tγ to Tbγ, and a constant loss ℓ>0. It is important to notice that for γ>0, the loss ℓ can be arbitrarily made small (by taking a first step T0 large enough). This observation is encouraging, and let the authors think that a tight upper bound could be proved.

Corollary 2 In particular, order-optimal and optimal algorithms for problem-dependent bound have γ=0 and δ=1, as well as f(t)=0, in which case Theorem 3 gives a simpler bound, as for any b>1, ℓ(γ=0,b)=b2b−1≥4, and so

RT(AT)≤clog(T)⟹RT(DT(A,(⌊Tbi0⌋)i∈N))≤b2b−1clog(T).

Remark: the optimal choice of b=2 gives a constant loss of ℓ(γ=0,b)=4, which is twice better than the loss of 8.0625 obtained1 in [3].

We observe that the constant loss from Theorem 3 blows up for δ=0 (as bδ−1→0), and we give below in Theorem 4 another result toward a more precise lower bound.

The lower bound from Theorem 4 below is a second step towards proving or disproving the converse of the lower bound of Theorem 2, for bounds of the form RT=O(Tγ(logT)δ) (with or without δ=0). and an exponential sequence. This result, along with the previous one Theorem 3, starts to answer the question of whether an exponential sequence is sufficient to conserve both minimax and problem-dependent guarantees with Doubling Tricks. Theorem 4 is proved in Appendix A.3.

Theorem 4 For stochastic models, if A satisfies RT(AT)≥cTγ, for c>0 and 0<γ≤1, then the anytime version A′:=DT(A,(Ti)i∈N) with the exponential sequence (Ti)i∈N of parameters T0∈N∗, a>1, b>1 (i.e., Ti=⌊T0aabi⌋), satisfies this lower bound for a certain constant c′>0,

∀T≥1,RT(A′)≥c′(T1b)γ.

If we could just take b→1 in the two previous Theorems 3 and 4, both results would match and prove that the exponential doubling trick can indeed be used to conserve minimax regret bounds. But b obviously cannot depend on T, and even if it can be taken as small as we want, it cannot be taken to 1 (it has to be constant when T→∞).

The first part is denoted g1(T):=∑LTi=0f(Ti) and is dealt with Lemma 7: the sum of f(Ti) is a o(∑LTi=0Tγi(log(Ti))δ), as f(t)=o(tγ(logt)δ) by hypothesis, and this sum of Tγi(log(Ti))δ is proved below to be bounded by Tbγ(log(T))δ. So g1(T)=o(Tbγ(logT)δ). The second part is c(T0a)γ∑LTi=0(abi)γ(log(T0aabi))δ. Define log+(x):=max(log(x),0)≥0, so whether T0a≤1 or >1, we always have log(T0aabi)≤log+(T0a)+log(abi). Then we can use Lemma 4 (Eq. ) to distribute the power on δ (as it is <1). So (log(T0aabi))δ≤(log+(T0a))δ+(log(a))δ(bi)δ with the convention that 0δ=0 (even if δ=0), and so this gives

≤g1(T)+c(T0a)γ[(log+(T0a))δLT∑i=0(abi)γ+(log(a))δLT∑i=0(abi)γ(bi)δ]

If γ=0 then the first sum is just LT+1=O(log(log(T))) which can be included in g1(T)=o((logT)δ) (still increasing), and so only the second sum has to be bounded, and a geometric sum gives ∑LTi=0(bi)δ≤bδbδ−1(bLT)δ. But if γ>0, we can naively bound the first sum by ∑LTi=0(abi)γ≤(LT+1)(abLT)γ Observe that abLT=(abLT−1)b≤(aTT0)b. So abLT=O(Tb) and LT+1=O(log(log(T))), thus the first sum is a O(Tbγlog(log(T)))=o(Tbγ(logT)δ) (as δ>0). In both cases, the first sum can be included in g2(T) which is still a o(Tbγ(logT)δ) Another geometric sum bounds the second sum by ∑LTi=0(abi)γ(bi)δ≤(abLT)γ∑LTi=0(bi)δ≤bδbδ−1(bLT)δ.

≤g1(T)+c1(abLT)γ(bLT)δ

We identify a constant multiplicative loss c1:=c(T0a)γbδbδ−1(loga)δ>0. The only term left which depends on LT is (abLT)γ(bLT)δ, and it can be bounded by using bLT=bbLT−1≤bloga(aTT0)=b+blog+a(TT0)≤b+bloga(T) (as T≥1), and again with abLT≤(aTT0)b. The constant part of bLT also gives a O(Tbγ) term, that can be included in g(T):=g2(T)+(aTT0)bγ which is still a o(Tbγ(logT)δ), and is still increasing as sum of increasing functions. So we can focus on the last term, and we obtain

≤g(T)+c1(blog(a))δ[(aT0)bTb]γ(logT)δ⟹RT(A′)≤g(T)+ℓ(γ,δ,T0,a,b)cTbγ(logT)δ with an increasing g(t)=o(tbγ(logt)δ).

So the constant multiplicative loss ℓ depends on γ and δ as well as on T0, a and b and is

ℓ(γ,δ,T0,a,b):=(aT0)(b−1)γ×b2δbδ−1>0,ifδ>0.

If T0=a, the loss ℓ(γ,δ,T0,a,b) is minimal at b∗(δ)=21/δ>1 and for a minimal value of minb>1ℓ(γ,δ,T0,a,b)=4 (for anyδ and γ). Finally, the a/T0 part tends to 0 if T0→∞ so the loss can be made as small as we want, simply by choosing a T0 large enough (but constant w.r.t.T).

(♠) Now for the other case of δ=0, we can start similarly, but instead of bounding naively ∑LTi=0(abi)γ by (LT+1)(abLT)γ we use Lemma 3 to get a more subtle bound: ∑LTi=0(abi)γ≤aγ+(1+1(log(a))(log(bγ)))(abLT)γ. The constant term gets included in g(T), and for the non-constant part, (abLT)γ is handled similarly. Finally we obtain the loss

We illustrate here the practical cost of using Doubling Trick, for two interesting non-anytime algorithms that have recently been proposed in the literature: Approximated Finite-Horizon Gittins indexes, that we refer to as AFHG, by [18] (for Gaussian bandits with known variance) and kl-UCB++ by [21] (for Bernoulli bandits).

We first provide some details on these two algorithms, and then illustrate the behavior of Doubling Tricks applied to these algorithms with different doubling sequences.

We denote by Xk(t):=∑s<tYA(s),s1(A(s)=k) the accumulated rewards from arm k, and Nk(t):=∑s<t1(A(s)=k) the number of times arm k was sampled. Both algorithms A assume to know the horizon T. They compute an index IAk(t)∈R for each arm k∈{1,…,K} at each time step t∈{1,…,T}, and use the indexes to choose the arm with highest index, i.e., A(t):=argmaxk∈{1,…,K}Ik(t) (ties are broken uniformly at random).

The algorithm AFHG can be applied for Gaussian bandits with variance V (=1 for our experiments). Let m(T,t)=T−t+1≥1, and let

The algorithm kl-UCB++ can be applied for bounded rewards in [0,1], and in particular for Bernoulli bandits. The binary Kullback-Leibler divergence is kl(x,y):=xlog(x/y)+(1−x)log((1−x)/(1−y)) (for 0<x,y<1), and let log+(x):=max(0,log(x))≥0. Let the function g(n,T):=log+(TKn(1+(log+(TKn))2)) for n≤T, and finally let

We present some results from numerical experiments on Bernoulli and Gaussian bandits. More results are presented in Appendix E. We present in pages and results for K=9 arms and horizon T=45678 (to ensure that no choice of sequence were lucky and had TLT−1=T or too close to it). We ran n=1000 repetitions of the random experiment, either on the same “easy” bandit problem μ (with evenly spaced means), or on n different random instances μ sampled uniformly in [0,1]K, and we plot the average regret on n simulations. The black line without markers is the (asymptotic) lower bound in ∑k≠k∗(kl(μk,μ∗))−1logT, from [17]. We consider kl-UCB++ for Bernoulli bandits (Figures Figure 2, Figure 3) or AFHG for Gaussian bandits (Figures Figure 4, Figure 6),

Each doubling trick algorithm uses the same T0=200 as a first guess for the horizon. We include both the non-anytime version that knows the horizon T, and different anytime versions to compare the choice of doubling trick. To compare against an algorithm that does not need the horizon, we also include kl-UCB by [8] as a baseline for Bernoulli problems, and UCB by [5] (knowing the variance V=1) for Gaussian problems. The doubling sequences we consider are a geometric one with b=2, and different exponential sequences: the “classical” one with b=2 and the “slow” one with b=1.1 both with a=T0=200, and the last one is using a=2, b=2. Despite what was proved theoretically in Theorem 3, using a=T0 and a large enough T0 improves regarding to using a=2 and a leading (T0/a) factor.

Another version of the Doubling Trick with “no restart”, denoted DTno-restart, is presented in Appendix C, but it is only an heuristic and cannot be applied to any algorithm A. Algorithm Figure 5 can be applied to kl-UCB++ or AFHG for instance, as they use T just as a numerical parameter (see Eqs. Equation 13 and Equation 14), but its first limitation is that it cannot be applied to DMED+[13] or EXP3++[24], or any algorithms based on arm eliminations, for example. A second limitation is the difficulty to analyze this “no restart” variant, due to the unpredictable effect on regret of giving non-uniform prior information to the underlying algorithm A on each successive sequence. An interesting future work would be to analyze it, either in general or for a specific algorithm like kl-UCB++. Despite its limitations, this heuristic exhibits as expected better empirical performance than DT, as can be observed in Appendix E.

We formalized and studied the well-known “Doubling Trick” for generic multi-armed bandit problems, that is used to automatically obtain an anytime algorithm from any non-anytime algorithm. Our results are summarized in Table ?. We show that a geometric doubling can be used to conserve minimax regret bounds (in √T), with a constant loss (typically ≥3.33), but cannot be used to conserve problem-dependent bounds (in logT), for which a faster doubling sequence is needed. An exponential doubling sequence can conserve logarithmic regret bounds also with a constant loss, but it is still an open question to know if minimax bounds can be obtained for this faster growing sequence. Partial results of both a lower and an upper bound, for bounds of the generic form Tγ(logT)δ, let use believe in a positive answer.

It is still an open problem to know if an anytime algorithm can be both asymptotically optimal for the problem-dependent regret (i.e., with the exact constant) and optimal in a minimax regret (i.e., have a √KT regret), but we believe that using a doubling trick on non-anytime algorithms like kl-UCB++ cannot be the solution. We showed that it cannot work with a geometric doubling sequence, and conjecture that exponential doubling trick would never bring the right constant either.