This work is supported by the French National Research Agency (ANR) with the project BADASS, (N ANR-16-CE40-0002), the CNRS with the project PEPS8 BIO, the French Ministry of Research (MENESR) and ENS Paris-Saclay.

Abstract

Multi-armed bandit algorithms have been recently studied and evaluated for Cognitive Radio (CR), especially in the context of Opportunistic Spectrum Access (OSA) [?]. Several solutions have been explored based on various models, but it is hard to exactly predict which could be the best for real-world conditions at every instants. Hence, expert aggregation algorithms can be useful to select on the run the best algorithm for a specific situation. Aggregation algorithms, such as Exp4 dating back from 2002 [?], have never been used for OSA learning, and we show that it appears empirically sub-efficient when applied to simple stochastic problems. In this article, we present an improved variant, called Aggregator. For synthetic OSA problems modeled as Multi-Armed Bandit (MAB) problems, simulation results are presented to demonstrate its empirical efficiency. We combine classical algorithms, such as Thompson sampling, Upper-Confidence Bounds algorithms (UCB and variants), and Bayesian or Kullback-Leibler UCB. Our algorithm offers good performance compared to state-of-the-art algorithms (Exp4, CORRAL or LearnExp[?]), and appears as a robust approach to select on the run the best algorithm for any stochastic MAB problem, being more realistic to real-world radio settings than any tuning-based approach.

Cognitive Radio (CR), introduced in 1999[?], states that a radio, by collecting information about its environment, can dynamically reconfigure itself in order to improve its functionality regarding various metrics. One of the main direction of research, called Dynamic Spectrum Access [?], is focused on the spectrum access when devices reconfigure themselves by simply changing the frequency of their wireless communication. The model of Opportunistic Spectrum Access (OSA) for CR considers one Secondary User (SU) trying to use a licensed radio network occupied by Primary Users (PU). The network usage from the PU determines the availability patterns of the radio channels, and the goal of the SU is to communicate as efficiently as possible, without interfering with the PU. Thus at each step, a SU first senses one channel, and only transmits if this channel is unoccupied by a PU.

A common simple model in the literature is to describe the PU impact on the availability of the K channels in the following way: channels are independent and identically distributed (i.i.d.), and their qualities follow parametric distributions, e.g., Bernoulli of means μ1,…,μK∈[0,1] for availabilities when dealing with binary sensing feedback. The SU has to select the best expected channel each time to maximize its throughput: if successful communications are seen as rewards, the SU has to maximize its cumulative rewards, as in the Multi-Armed Bandit (MAB) problem [?].

MAB learning algorithms have been shown to be useful for the OSA setting [?], and UCB algorithms and other variants (e.g., kl-UCBor Bayes-UCB, [?]) have been successfully applied to both numerical and physically simulated CR problems [?]. The performance of such learning algorithm A can be measured by different criteria. For example, it is common in the bandit literature to study the regret [?]˜RAT=μ∗T−∑Tt=1r(t) which compares the loss in rewards between the algorithm A and the full-knowledge strategy which always picks the best arm, i.e., the most available of mean μ∗. Good algorithms are expected to have slow-growing expected regret, but other criterion include the best arm pull frequency, or the throughput of the SU.

Many different learning algorithms have been proposed by the machine learning community, and most of them depend on several parameters, for instance α for UCB, the prior for Thompson sampling or BayesUCB, the kl function for kl-UCB etc. Every time a new MAB algorithm A is introduced, it is compared and benchmarked on some bandit instance, parameterized by μ=(μ1,…,μK), usually by focusing on its expected regret RT=Eμ[˜RT]. For a known and specific instance, simulations help to select the best algorithm in a pool of algorithms. But when one wants to tackle an unknown real-world problem, one expects to be efficient against any problem, of any kind, size and complexity: ideally one would like to use an algorithm that can be applied identically against any problem. To choose the best algorithm, two approaches can be followed: either extensive benchmarks are done beforehand – if this is possible – to select the algorithm and its optimal parameters, or an adaptive algorithm is used to learn on the fly its parameters. We present a simple adaptive solution, that aggregates several learning algorithms in parallel and adaptively chooses which one to trust the most.

This paper is organized as follows: our OSA model is described in Section 2, and MAB learning algorithms are briefly presented in Section 3. We explain in Section 4 how to combine such algorithms for aggregation. Our proposed algorithm, called Aggregator, is detailed in Section 4.1, with numerical experiments presented in Section 5, comparing the regret of several algorithms against different aggregation algorithms. Theoretical guarantees are shortly discussed in Section 6, and Section 7 concludes.

We consider K≥1 radio channels, also called arms, of different characteristics, unknown to the user. The radio protocol is slotted in both time and frequency, meaning that at each time step t∈N, the Secondary User (SU) tries to communicate in a channel A(t)∈{1,…,K}. In the OSA model, the SU first senses one channel k at a time, and can use it to communicate only if it was sensed free from any PU (they have full priority over the SU).

In the stochastic model considered in this paper, after choosing the arm k, it is assumed that the sensing provides a rewardrk(t), randomly drawn from a certain distribution depending on the arm index. Rewards are assumed to be bounded in [0,1], and generally they follow one-parameter exponential families. We present our algorithm by restricting to Bernoulli distributions1, for sake of simplicity, meaning that arm k has a parameter μk∈[0,1] and rewards are drawn from B(μk), rk(t)∼B(μk), which can be simply interpreted by the SU: it is 1 if the channel k is not used by any PU during the time slot t, and is 0 otherwise. Aggregation algorithms usually deal with losses rather rewards [?], so we also introduce the quantity ℓk(t):=1−rk(t)∈[0,1].

An algorithm A has to maximize its cumulative rewards, by choosing the arm A(t)∈{1,…,K} at time t, or, equivalently, to minimize its pseudo-regret ˜RAT is defined as

˜RAT:=T∑t=1(μ∗−rA(t)(t))=Tμ∗−T∑t=1rA(t)(t),(1)

where μ∗ is the mean of the best arm: μ∗=maxkk=1μk=maxkk=1Eμ[rk(t)], for an expectation taken on the arm distributions. This pseudo-regret is random so we prefer to focus on the expected regret, RAT:=Eμ[˜RAT].

The UCB algorithm [?] selects the arm with highest index, where each index is an Upper Confidence Bound on the unknown mean, computed as the sum of the empirical mean of each arm ^μk(t)=Xk(t)/Nk(t) (if Xk(t):=∑Tt=1rA(t)(t)1(A(t)=k) and Nk(t):=∑Tt=11(A(t)=k)), and an exploration term defined by √αlog(t)/Nk(t). α=1/2 is known to yield logarithmic regret on all problems, but on some specific instance μ a better value of α may be found empirically [?].

The kl-UCB algorithm is similar, but instead it uses a Kullback-Leibler divergence function to compute a statistically better UCB [?]. As a different KL function exists for each different exponential family, this algorithm also requires a prior knowledge of the problem to be efficient.

The Thompson sampling (TS) [?] algorithm is Bayesian: it maintains a posterior distribution on each means (e.g., Beta posteriors for Bernoulli arms), updated after each observation, and chooses an arm by sampling a random mean from each posterior and playing the arm with highest mean. The posterior distribution has to be chosen according to the exponential family as the conjugated posterior.

Both UCB, kl-UCB and TS have been proved to have logarithmic regrets [?], meaning that RAT=O(logT), in Bernoulli bandit problems and also under more general assumptions. The constant in the big-O is important, and [?] showed that in this setting, the regret of any (uniformly efficient) algorithm is at least C(μ)logT when T is large, for a constant C(μ) depending only on the problem instance μ : C(μ)=∑μk≠μ∗(μ∗−μk)/kl(μk,μ∗) (with a unique best arm), where kl(x,b) is the binary KL divergence between two Bernoulli distributions of parameters x and y.

We assume to have N≥2 MAB algorithms, A1,…,AN, and let Aaggr be an aggregation algorithm, which runs the N algorithms in parallel (with the same slotted time), and use them to choose its channels based on a voting from their N decisions. Aaggr depends on a pool of algorithms and a set of parameters. We would like that Aaggr performs almost as well as the best of the Aa, with a good choice of its parameters, independently of the MAB problem. Ideally Aaggr should perform similarly to the best of the Aa. To simplify the presentation, we only aggregate bandit algorithms that give deterministic recommendations: one arm is chosen with probability 1 and the others with probability 0. However, both Exp4 and Aggregator can be adapted to aggregate randomized bandit algorithms, i.e., algorithms that output a probability distribution ξt over the arms {1,…,K} at each time step, and draw the next selected arm according to this distribution.

The aggregation algorithm maintains a probability distribution πt on the N algorithms Aa, starting from a uniform distribution: πta is the probability of trusting the decision made by algorithm Aa at time t. Aaggr then simply performs a weighted vote on its algorithms: it decides whom to trust by sampling a∈{1,…,N} from πt, then follows Aa’s decision. The main questions are then to know what observations (i.e., arms and rewards) should be given as feedback to which algorithms, and how to update the trusts at each step, and our proposal Aggregator differs from Exp4 on these very points.

At every time step, after having observed a loss ℓAt+1(t+1) for its chosen action At+1, the algorithm updates the trust probabilities from πt to πt+1 by a multiplicative exponential factor (using the learning rate and the unbiased loss). Only the algorithms Aa who advised the last decision get their trust updated, in order to trust more the “reliable” algorithms. The loss estimate is unbiased in the following sense. If one had access to the rewards rk(t+1) (or the losses ℓk(t+1)) for all arms k, the loss incurred by algorithm a at time t+1 would be ~ℓa,t+1=ℓAt+1a(t+1). This quantity can only be observed for those algorithms for which At+1a=At+1. However, by dividing by the probability of observing this recommendation, one obtains an unbiased estimate of ~ℓa,t. More precisely,

^ℓa,t+1:=ℓAt+1a(t+1)pt+1At+1a1(At+1a=At+1)

satisfies E[^ℓa,t+1|Ht]=~ℓa,t+1, for all a, where the expectation is taken conditionally to the history of observations up to round t, Ht. Observe that ~ℓa,t+1=ℓt+1 for all algorithms a such that At+1a=At+1, and ~ℓa,t+1=0 otherwise.

An important feature of Aggregator is the feedback provided to each underlying bandit algorithm, upon the observation of arm At+1. Rather than updating only the trusted algorithms (that is the algorithms which would have drawn arm At+1) with the observed reward rAt+1(t+1)=1−ℓAt+1(t+1), we found that updating each algorithm with the (original) loss observed for arm At+1 improves the performance drastically. As expected, the more feedback they get, the faster the underlying algorithms learn, and the better the aggregation algorithm will be [?].

Regarding the update of πt, one can note that the trust probabilities are not all updated before the normalization step, and an alternative would be to increase πa if At+1a=At+1 and to decrease it otherwise. It would not be so different, as there is a final renormalization step, and empirically this variation has little impact on the performance of Aggregator.

The Exp4 algorithm (see, e.g. [?]) is similar to Aggregator, presented in Algorithm ?, but differs in the two following points. First, a∼πt is sampled first and the arm chosen by Aa is trusted, whereas Aggregator needs to listen to the N decisions to perform the updates on πt+1, and Exp4 gives back an observation (arm, reward) only to the last trusted algorithm whereas Aggregator gives it to all algorithms. Second, after having computed the loss estimate ℓ, Exp4 updates the estimated cumulative loss for each algorithm, ˜La(t)=∑ts=1ℓAsa(s)×1(Asa=Asaggr). Instead of updating πt multiplicatively as we do for our proposal, Exp4 recomputes it, proportionally to exp(−ηt˜La(t)).

The sequence of non-negative learning rates (ηt)t≥1 used by Exp4 can be arbitrary, it can be constant but should be non-increasing [?]. If the horizon T is known (and fixed), the best choice is given by ηt=η:=2log(N)/(TK). However, for real-world communication problems, it is unrealistic to assume a fixed and known time horizon, so we prefer the alternative horizon-free choice of learning rates, ηt:=log(N)/(tK) suggested by [?]. We compare both approaches empirically, and the second one usually performs better. We also stick to this choice of (ηt)t≥1 for Aggregator.

Figure 1: On a “simple” Bernoulli problem (semilog-y scale).Figure 2: On a “harder” Bernoulli problem, they all have similar performances, except LearnExp.Figure 3: On an “easy” Gaussian problem, only Aggregator shows reasonable performances, thanks to BayesUCB and Thompson sampling.

We focus on i.i.d. MAB problems, with K=9 channels2. For Bernoulli problem, the first one uses μ=[0.1,…,0.9], and the second one is divided in three groups: 2 very bad arms (μ=0.01,0.02), 5 average arms (μ=0.3 to 0.6) and 3 very good arms (μ=0.78,0.8,0.82). The horizon is set to T=20000 (but its value is unknown to all algorithms), and simulations are repeated 1000 times, to estimate the expected regret. This empirical estimation of the expected regret RT is plotted below, as a function of T, comparing some algorithms A1,…,AN (for N=6), and their aggregation with Aggregator (displayed in orange bold). The Lai & Robbins’ logarithmic lower-bound [?] is also plotted, and it is crucial to note that it is only asymptotic and to not be surprised by having regret curves smaller than the lower-bound (e.g., for the easier Bernoulli problem). Note that for each of the 1000 simulations, we choose to generate all the rewards beforehand, i.e., one full matrix (rk(t))1≤k≤K,1≤t≤T for every repetition, in order to compare the algorithms on the same realizations of the MAB problem.

We compare our Aggregator algorithm, as well as other aggregation algorithms, Exp4, CORRAL and LearnExp (both with default parameters) [?]. The aggregated algorithms consist in a naive uniform exploration (to have at least one algorithm with bad performances, i.e. linear regret, but it is not included in the plots), UCB with α=1/2, three kl-UCB with Bernoulli, Gaussian and exponential kl functions, and BayesUCB and Thompson sampling with uniform prior.

Figures Figure 1 and Figure 4 are in semilog-y scale, this helps to see that the best algorithms can be an order of magnitude more efficient than the worst, and the Aggregator performs similarly to the best ones, when the other aggregation algorithms are usually amongst the worst. Figure 5 is in semilog-x scale to show that the regret of efficient algorithms are indeed logarithmic.

Figure 4: On a harder problem, mixing Bernoulli, Gaussian, Exponential arms, with 3 arms of each types with the same mean.Figure 5: The semilog-x scale clearly shows the logarithmic growth of the regret for the best algorithms and our proposal Aggregator, even in a hard “mixed” problem (cf. Figure ).

For Bernoulli problems (Figures Figure 1 and Figure 2), UCB with α=1/2, Thompson sampling, BayesUCB and kl-UCB+ (with the binary kl function) all perform similarly, and Aggregator is found to be as efficient as all of them. For Gaussian and exponential arms, rewards are truncated into [0,1], and the variance of Gaussian distributions is fixed to σ2=0.05 for all arms, and can be known to the algorithms (the kl function is adapted to this one-dimensional exponential family). Figure 3 uses only Gaussian arms, with a large gap between their means and a relatively small variance, giving an “easy” problem. And Figure 4 shows a considerably harder “mixed” problem, when the distributions are no longer in the same one-dimensional exponential family and so the Lai & Robbins’ lower-bound no longer holds (even if there still exists a lower-bound).

For each of the 4 problems considered, the Aggregator algorithm with default option (broadcast loss to all players) is the best of all the aggregation algorithms, and its regret is very close to the best of the aggregated algorithms. Especially in difficult problems with mixed or unknown distributions, Aggregator showed to be more efficient that Exp4 and orders of magnitude better than the other reference aggregation algorithms LearnExp and CORRAL (see Figures Figure 4 and Figure 5).

The Aggregator does not have satisfying theoretical guarantees in terms of regret RT yet, unlike many bandit algorithms. Another notion, the adversarial regret, denoted by ¯¯¯¯¯¯¯RT, measures the difference in terms of rewards, between the aggregation algorithm Aaggr and the best aggregated algorithm Aa. This is in contrast with the (classical) regret, which measure the difference with the best fixed-arm strategy (Eq. ). Thus, even if the aggregated algorithms have logarithmic (classical) regret, having an adversarial regret scaling as √T does not permit to exhibit a logarithmic (classical) regret for the aggregation algorithm. Under some additional hypotheses, [?] proves that Exp4 satisfies a bound on adversarial regret, ¯¯¯¯¯¯¯RT≤2√TNlogk, with the good choice of the learning rate sequence (ηt)t≥1. Our proposed algorithm follows quite closely the architecture of Exp4, and a similar bound for Aggregator is expected to hold.

This would be a first theoretical guarantee, but not satisfactory as simple algorithms like UCB have regrets scaling as logT[?], not √T. Regret bounds in several different settings are proved for the CORRAL algorithm [?], but no logarithmic upper-bound can be obtained from their technique, even in the simplest setting of stochastic bandits. However, Aggregator always seems to have a (finite-horizon) logarithmic regret in all the experiments we performed, for both Bernoulli and non-Bernoulli problems (e.g., Gaussian, exponential and Poisson distributions). Further theoretical developments are left as future work.

We presented the use of aggregation algorithms in the context of Opportunistic Spectrum Access for Cognitive Radio, especially for the real-world setting of unknown problem instances, when tuning parameters before-hand is no longer possible and an adaptive algorithm is preferable. Our proposed Aggregator was presented in details, and we also highlighted its differences with Exp4.

We realized experiments on simple MAB problems already used in the community of bandit algorithms for OSA [?], and the simulations results showed that Aggregator works as expected, being able to identify on the fly the best algorithm to trust for a specific problem. Experiments on problems mixing different families of distributions were also presented, with similar conclusions in favor of Aggregator. It is not presented in this article, but our proposed algorithm also works well in dynamic scenarios, in which the distribution of the arms can change abruptly at some time, and appears to be more robust than simple non-aggregated algorithms.
Exp4 has theoretical guarantees in terms of adversarial regret, and even if the same result could hold for Aggregator, results in terms of classical regret are yet to be proved. Empirically, Aggregator showed to always have a logarithmic regret RT if it aggregates algorithms with logarithmic regrets (like UCB, kl-UCB, Thompson sampling, BayesUCB etc). It usually succeeds to be close to the best of the aggregated algorithms, both in term of regret and best arm pull frequency. As expected, the Aggregator is never able to outperform any of the aggregated algorithms, but this was an over-optimistic goal. What matters the most is that, empirically, Aggregator is able to quickly discover on the fly the best algorithms to trust, and then performs almost as well as if it was following it from the beginning.

Our Aggregator algorithm can probably be rewritten as an Online Mirror Descent, as Exp4 and CORRAL, but this does not appear useful as in the case of CORRAL the analysis cannot bring a logarithmic bound on the regret even by aggregating asymptotically optimal algorithms. We will continue investigating regret bounds for Aggregator, and other directions include possible applications to the non-stochastic case (e.g., rested or restless Markovian problems, like it was very recently studied in [?]).

Footnotes

The model is similar for other distributions, and we also experimented and tested our proposal Aggregator with Gaussian, exponential and Poisson distributions, with unbounded or finite in [0,1] support, and similar conclusions were observed. Non-discrete rewards rk(t) are interpreted as a relative communication efficiency, but we do not cover this aspect here.

Similar behaviors are observed for any not-too-large values of K, we tried up-to K=100 and the same results were obtained.

The model is similar for other distributions, and we also experimented and tested our proposal Aggregator with Gaussian, exponential and Poisson distributions, with unbounded or finite in [0,1] support, and similar conclusions were observed. Non-discrete rewards rk(t) are interpreted as a relative communication efficiency, but we do not cover this aspect here.

Similar behaviors are observed for any not-too-large values of K, we tried up-to K=100 and the same results were obtained.