How to Use Expert Advice

Transcription

1 NICOLÒ CESA-BIANCHI Università di Milano, Milan, Italy YOAV FREUND AT&T Labs, Florham Park, New Jersey DAVID HAUSSLER AND DAVID P. HELMBOLD University of California, Santa Cruz, Santa Cruz, California ROBERT E. SCHAPIRE AT&T Labs, Florham Park, New Jersey AND MANFRED K. WARMUTH University of California, Santa Cruz, Santa Cruz, California Abstract. We analyze algorithms that predict a binary value by combining the predictions of several prediction strategies, called experts. Our analysis is for worst-case situations, i.e., we make no assumptions about the way the sequence of bits to be predicted is generated. We measure the performance of the algorithm by the difference between the expected number of mistakes it makes on the bit sequence and the expected number of mistakes made by the best expert on this sequence, where the expectation is taken with respect to the randomization in the predictions. We show that the minimum achievable difference is on the order of the square root of the number of mistakes of the best expert, and we give efficient algorithms that achieve this. Our upper and lower bounds have matching leading constants in most cases. We then show how this leads to certain kinds of pattern recognition/learning algorithms with performance bounds that improve on the best results currently This research was done while N. Cesa-Bianchi was visiting UC, Santa Cruz, and was partially supported by the Progetto finalizzato sistemi informatili e calcolo parallelo of CNR under grant D. Haussler, M. K. Warmuth, and Y. Freund were supported by ONR grant N J-116 and NSF grant IRI Authors addresses: N. Cesa-Bianchi, Università di Milano, Milan, Italy, Y. Freund and R. E. Schapire, AT&T Labs, 180 Park Ave., Florham Park, NJ , {yoav, D. Haussler, D. P. Helmbold, and M. K. Warmuth, University of California, Santa Cruz, Santa Cruz, CA. Permission to make digital / hard copy of part or all of this work for personal or classroom use is granted without fee provided that the copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery (ACM), Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and / or a fee ACM /97/ $03.50 Journal of the ACM, Vol. 44, No. 3, May 1997, pp

2 48 N. CESA-BIANCHI ET AL. known in this context. We also compare our analysis to the case in which log loss is used instead of the expected number of mistakes. Categories and Subject Descriptors: I..1 [Artificial Intelligence]: Applications and Expert Systems; I.. [Artificial Intelligence]: Automatic Programming automatic analysis of algorithms; I..6 [Artificial Intelligence]: Learning knowledge acquisition General Terms: Algorithms 1. Introduction A central problem in statistics and machine learning is the problem of predicting future events based on past observations. In computer science literature in particular, special attention has been given to the case in which the events are simple binary outcomes (e.g., [Haussler et al. 1994]). For example, in predicting today s weather, we may choose to consider only the possible outcomes 0 and 1, where 1 indicates that it rains today, and 0 indicates that it does not. In this paper, we show that some simple prediction algorithms are optimal for this task in a sense that is closely related to the definitions of universal forecasting, prediction, and data compression that have been explored in the information theory literature. We then give applications of these results to the theory of pattern recognition [Vapnik 198] and PAC learning [Valiant 1984]. We take the extreme position, as advocated by Dawid and Vovk in the theory of prequential probability [Dawid 1984; 1991; 1996; Vovk 1993], Rissanen in his theory of stochastic complexity [Rissanen 1978; 1986; Rissanen and Langdon Jr. 1981; Yamanishi 1995], and Cover, Lempel and Ziv, Feder and others in the theory of universal prediction and data compression of individual sequences, 1 that no assumptions whatsoever can be made about the actual sequence y y 1,..., y of outcomes that is observed; the analysis is done in the worst case over all possible binary outcome sequences. Of course, no method of prediction can do better than random guessing in the worst case, so a naive worst-case analysis is fruitless. To illustrate an alternative approach in the vein of universal prediction, consider the following scenario. Let us suppose that on each morning t you must predict whether or not it will rain that day (i.e., the value of y t ), but before making your prediction you are allowed to hear the predictions of a (fixed) finite set { 1,..., N } of experts. On the morning of day t, each expert has access to the weather outcomes y 1,..., y t1 of the previous t 1 days, and possibly to the values of other weather measurements x 1,..., x t1 made on those days, as well as today s measurements x t. The measurements x 1,..., x t will be called instances. Based on this data, each expert returns a real number p between 0 and 1 that can be interpreted as his/her estimate of the probability that it will rain that day. After hearing the predictions of the experts, you also choose a number p [0, 1] as your estimate of the probability of rain. Later in the day, nature sets the value of y t to either 1 or 0 by either raining or not raining. In the evening, you and the experts are scored. A person receives the loss p y for making prediction p [0, 1] when the actual outcome is y {0, 1}. To see why this is a reasonable 1 See, for example, Feder et al. [199], Merhav and Feder [1993], Cover [1965], Cover and Shanar [1977], Hannan [1957], Vovk [1993], and Chung [1994].

3 49 measure of loss, imagine that instead of returning p [0, 1] you tossed a biased coin and predicted outcome 1 with probability p and outcome 0 with probability 1 p. Then p y is the probability that your prediction is incorrect when the actual outcome is y. Imagine that the above prediction game is played for days. Let us fix the instance sequence x 1,...,x, since it plays only a minor role here, and vary only the outcome sequence y y 1,...,y. During the days, you accumulate a total t1 loss L(y) ŷ t y t, where ŷ t [0, 1] is your prediction at time t. Each of the experts also accumulates a total loss based on his/her predictions. Your goal is to try to predict as well as the best expert, no matter what outcome sequence y is produced by nature. 3 Specifically, if we let L (y) denote the minimum total loss of any expert on the particular sequence y, then your goal is to minimize the maximum of the difference L(y) L (y) over all possible binary sequences y of length. Since most outcome sequences will look totally random to you, you still won t be able to do better than random guessing on most sequences. However, since most sequences will also look totally random to all the experts (as long as there aren t too many experts), you may still hope to do almost as well as the best expert in most cases. The difficult sequences are the ones that have some structure that is exploited by one of the experts. To do well on these sequences you must quickly zero in on the fact that one of the experts is doing well, and match his/her performance, perhaps by mimicking his/her predictions. Through a game-theoretic analysis, we find that for any finite set of experts and any prespecified sequence length, there is a strategy that minimizes the maximum of the difference L(y) L (y) over all possible binary outcome sequences y of length. While this min/max strategy can be implemented in some cases, it is not practical in general. However, we define an algorithm, called P for Predict, that is simple and efficient, and performs essentially as well as the min/max strategy. Actually P is a family of algorithms that is related to the algorithm studied by Vovk [1990] and the Bayesian, Gibbs, and weighted majority methods studied by a number of authors, 4 as well as the method developed by Feder et al. [199]. We show that P performs quite well in the sense defined above so that, for example, given any finite set of weather forecasting experts, P is guaranteed not to perform much worse than the best expert in, no matter what the actual weather turns out to be. The algorithm P is completely generic in that it makes no use of the side information provided by the instances x 1,...,x. Thus, it would also do almost as well as the Wall Street expert with the best inside information when predicting whether the stock market will rise or fall. An alternate logarithmic loss function, often considered in the literature, is discussed briefly in Section 8. 3 This approach is also related to that taken in recent work on the competitive ratio of on-line algorithms, and in particular to work on combining on-line algorithms to obtain the best competitive ratio [Fiat et al. 1991a; 1991b; 1994], except that we look at the difference in performance rather than the ratio. 4 See, for example, Littlestone and Warmuth [1994], Littlestone et al. [1995], Haussler et al. [1994], Sompolinsky et al. [199], Seung et al. [199], Haussler and Barron [199], and Hembold and Warmuth [1995].

4 430 N. CESA-BIANCHI ET AL. In particular, letting L P (y) denote the total loss of algorithm P on the sequence y and L (y) the loss of the best expert on y as above, we show (Theorem 4.5.1) that for all binary 5 outcome sequences y of length, ln 1 L P y L y log 1, and that no algorithm can improve the multiplicative constant of the square-root term for, 3, where is the number of experts. Previous work has shown how to construct an algorithm A such that the ratio L A (y)/l (y) approaches 1 in the limit [Vovk 1990; Littlestone and Warmuth 1994; Feder et al. 199]. In fact, Vovk [1990] described an algorithm with the same bound as the one we give in Theorem 4..1 for the algorithm P. This theorem leaves a parameter to be tuned. Vovk gives an implicit form of the optimum choice of the parameter. We arrive at an explicit form that allows us to prove nearly optimal bounds on L A (y) L (y). To our knowledge, our results give the first precise bounds on this difference. It turns out that these bounds also give a tight lower bound on the expectation of the minimal L 1 distance between a random binary string uniformly chosen from {0, 1} and a set of N points in [0, 1]. This answer to a basic combinatorial question may be of independent interest. The remainder of this paper is organized as follows: In Section 3, we characterize exactly the performance of the best possible prediction strategy using a min/max analysis. Section 4 describes the algorithm P and shows that it achieves the optimal bound given above. In Section 4.4, we show that, if the loss L (y) of the best expert is given to the algorithm a priori, then P can be tuned so that L P y L y L y ln log. In Section 4.6, we show that even when no knowledge of L (y) is available, one can use a doubling trick to obtain a bound on L P (y) L (y) that is only a small constant factor larger than the above bound. This algorithm can nearly match the performance of the best expert on all prefixes of an infinite sequence y. Finally, in Section 5, we show how the results we have obtained can be applied in another machine learning context. We describe a pattern recognition problem in which examples ( x 1, y 1 ),..., (x t1, y t1 ) are drawn independently at random from some arbitrary distribution on the set of all possible labeled instances and the goal is to find a function that will predict the binary label y t of the next random example ( x t, y t ) correctly. Performance is measured relative to the best binary-valued function in a given class of functions, called the comparison class. This kind of relative performance measure is called regret in statistics. General solutions to this regret formulation of the pattern recognition problem have been developed by Vapnik [198], Birge and Massart [1993], and others. This problem can also be described as a special variant of the probably approxi- 5 The algorithm has recently been extended to the case when the outcomes are in the interval [0, 1] with the performance bounds as in the binary case [Haussler et al. 1995].

5 431 mately correct (PAC) learning model [Valiant 1984] in which nothing is assumed about the target concept that generates the examples other than independence between examples (sometimes referred to as agnostic learning [Kearns et al. 1994]), and in which the learning algorithm is not required to return a hypothesis in any specific form. Using the prediction strategy P, we develop an algorithm that solves this pattern recognition problem and derive distribution-independent bounds for the performance of this algorithm. These bounds improve by constant factors some of the (more general) bounds obtained by Vapnik [198] and Talagrand [1994] on the performance of an empirical loss minimization algorithm. The results presented in this paper contribute to an ongoing program in information theory and statistics to minimize the number of assumptions placed on the actual mechanism generating the observations through the development of robust procedures and strengthened worst-case analysis. In investigating this area, we have been struck by the fact that many of the standard-style statistical results that we have found most useful, such as the bounds given by Vapnik, have worst-case counterparts that are much stronger than we had expected would be possible. We believe that if these results can be extended to more general loss functions and learning/prediction scenarios, with corresponding optimal estimation of constants and rates, this worst-case viewpoint may ultimately provide a fruitful alternative foundation for the statistical theory of learning and prediction.. An Overview of the Prediction Problem In this section, we define the problem of predicting binary sequences and give an overview of our results on this problem. We refer to the binary sequence to be predicted as the outcome sequence, and we denote it by y y 1,..., y t,..., y, where t is the index of a typical time step or trial, y t {0, 1}, and is the length of the sequence. We denote by y t the prefix of length t of y, that is, y t y 1,..., y t. We denote the set of experts by { 1,..., N }, where N is the number of experts. The prediction of expert i at time t is denoted by i,t [0, 1] and the prediction of the algorithm at time t is denoted by ŷ t [0, 1]. A prediction algorithm is an algorithm that at time t 1,...,, receives as input a vector of expert predictions 1,t,..., N,t, as well as the predictions made by the experts in the past (i.e., 1,1,..., N,1,..., 1,t1,..., N,t1 ), the sequence of past outcomes (i.e., y t1 ), and the predictions made by the algorithm in the past (i.e., ŷ 1... ŷt1 ). The prediction algorithm maps these inputs into its current prediction ŷ t. The loss of prediction algorithm A on a sequence of trials with respect to a sequence of outcomes y (and set of experts) is defined to be the sum t1 ŷ t y t which is denoted L A (y). Note that the set of experts will always be understood from context so we can suppress the dependence of L A (y) on. Similarly, the loss of expert i with respect to y is defined to be t1 i,t y t and is denoted L i (y). Finally, the loss of the best expert is denoted by L (y); thus, L (y) min i1,...,n L i (y). Our goal is to find algorithms whose loss L A (y) is not much larger than L (y). Moreover, our ultimate goal is to prove bounds that hold uniformly for all

6 43 N. CESA-BIANCHI ET AL. outcome sequences and expert predictions, and that assume little or no prior knowledge on the part of the prediction algorithm. This problem can be viewed as a game in which the predictor plays against an adversary who generates both the experts predictions and the outcomes. We assume that both players can observe all of the actions made by the other player up to the current point of time, as well as its own past actions. The game consists of time steps, and both sides know before the game begins. We now describe the binary sequence prediction game. At each time step, t 1..., the game proceeds as follows: The adversary chooses the experts predictions, i,t [0, 1], for 1 i N. The predictor generates its prediction ŷ t [0, 1]. The adversary chooses the outcome y t {0, 1}. The goal of the predictor in this game is to minimize its net loss: L A (y) L (y). The goal of the adversary is to maximize this value. 6 The min/max value for this game, is the worst case net loss of the optimal prediction strategy. We will denote this min/max value by V N,. In the following section, we give the optimal min/max strategy for the predictor and for the adversary in this game. This analysis gives a simple recursive equation for V N,. Unfortunately, we don t have a closed form expression that solves this equation. However, using results obtained in Sections 3 and 4, we can show that V N, 1 o1 ln N, where o(1) 3 0asN,3. In Section 3.1, we analyze the optimal prediction algorithm for a case in which the adversary is somewhat restricted. Using this restriction of the game we find an explicit closed form expression that lower bounds V N,. The adversary is restricted in that the predictions of the experts are functions only of the trial number. In other words, each expert is a fixed sequence of numbers in [0, 1]. We call these static experts. We also assume that these sequences are known to the predictor in advance. We derive the exact min/max solution for this restricted game for any choice of the sequences. We obtain our explicit lower bound by analyzing the case in which the N expert sequences are chosen using independent coin flips. In Section 4, we present a family of prediction algorithms for the general prediction game. The basic algorithm, which we call P has a real-valued parameter,, which controls its behavior. This parameter plays a similar role to the learning rate parameter used in gradient based learning algorithms [Haykin 1994]. Different choices of guarantee different performance bounds for the algorithm. The optimal choice of is of critical importance and occupies much of the discussion in Sections and also later in Section Formally, an expert in this context is a function of the form i : ([0, 1] {0, 1})* 3 [0, 1]. The interpretation here is that i maps a finite sequence (( ŷ 1, y 1 ),..., (ŷ t1, y t1 )) of prediction/ outcome pairs to a new expert prediction i,t. (Note that each i function can compute the value of the other j functions, and thus the experts predictions can depend on the predictions made by experts in the past, as well as the current time t.)

7 433 We analyze three variants of the algorithm, each of which chooses in a different way, according to the type of knowledge available to the predictor. The first variant chooses when the predictor knows only an upper bound on the loss of the best expert. The second variant chooses in a situation where the predictor knows only the length of the game. The third variant handles the case where the predictor knows nothing at all in advance. Using the analysis of the second case, we get an upper bound for V N, that asymptotically matches the lower bound from Section An Optimal Prediction Strategy We now give the optimal prediction algorithm for the binary sequence prediction problem. This algorithm is based on the optimal min/max solution of the binary sequence prediction game described in the previous section, guaranteeing that it has the best possible worst-case performance. However, the algorithm is computationally expensive. The following function plays a major role in the construction and analysis of the optimal prediction strategy. Let denote the nonnegative reals, and denote the nonnegative integers. We define the function v :( ) N 3 inductively as follows: vm, 0 min M i (1) 1iN vm, r min Z0,1 N vm Z, r 1 vm 1 Z, r 1 () where the 1 in the expression M 1 Z denotes the vector of N 1 s, and M i is the ith component of vector M. Clearly, this function is well defined and can, in principle, be calculated for any given M and r. We will discuss the complexity of this computation after the proof of Theorem 3.. The parameters of the function v are interpreted as follows: The integer r denotes the number of remaining trials, that is, the number of sequence bits that remain to be predicted. The past loss incurred by the expert i when there are r remaining trials will be denoted M i r, and M r will denote the vector M 1 r,..., M N r. It is the quantity v(m r, r) that will be important in our analysis. In some sense, v(m r, r) is measuring the anticipated loss of the best expert on the entire sequence of trials. In order to show that our prediction strategy generates predictions that are in the range [0, 1], we will need the following lemma, which shows that the function v(m, r) obeys a Lipschitz condition: LEMMA 3.1. For any r N and any X, Y ( ) N where X Y max i X i Y i. vx, r vy, r X Y, PROOF. The proof is by induction on r: If r 0, let i 0 be an index that minimizes {X i } and j 0 be an index that minimizes {Y i }. Then

9 435 Algorithm MM 1. Initialize: Y t : 1 Y r : Y M : 0 {current trial number} {number of remaining trials} {current cumulative loss vector}. While t, repeat: Y Receive the predictions of the N experts, Z r 1,t,..., N,t. Y Compute and output prediction ŷ t Mr Z r, r 1 vm r 1 Z r, r 1 1 where v is defined by Eq. (1) and (). Y Receive the correct outcome y t Y M i r1 : M i r y t i,t for i 1,...,N. Y t:t1 Y r:r1 FIG. 1. Description of algorithm MM. The following theorem, the main result of this section, characterizes the loss of this strategy exactly in terms of the function v, and shows moreover that this strategy is the best possible. THEOREM 3.. Let MM be the prediction strategy described above. Then for any set of experts and for any outcome sequence y, the loss of MM is bounded by L MM y L y v0,, where is the number of prediction trials, N is the number of experts, and 0 is a vector of N zeros. Moreover, MM is optimal in the sense that, for every prediction strategy A, there exists a set of experts and an outcome sequence y for which L A y L y v0,. Hence V N, / v(0, ). PROOF. The first part of the theorem is proved using induction on the number r of remaining trials. As above, let M r be an N dimensional vector that describes the losses of each of the N experts on the first r trials (so r trials remain) and let r denote the loss incurred by MM on these first r trials. Then our inductive hypothesis is a bound on the net loss of MM at the end of the game, namely, L MM y L y r r vmr, r. (4)

10 436 N. CESA-BIANCHI ET AL. It is clear that if we choose r we get the statement of the theorem, since M 0. We now present the inductive proof of the claim. For r 0, the claim follows directly from the definitions since v(m 0, 0) is equal to the loss of the best expert at the end of the game, r/ 0, and 0 is the loss of MM. For r 0, let Z r 1,t,..., N,t denote the predictions given by the experts at trial t r 1 (i.e., when there are r future outcomes to predict). Using the inductive assumption for r 1 and Eq. (3) we can calculate the loss of MM at the end of the game; for the two possible values of the next outcome y t we get that the net loss is bounded by the same quantity which agrees with the claim for r remaining trials. If y t 0, then the loss of MM up to the next step is r1 r ŷ, and the loss of the experts is M r1 M r Z r. Using the inductive assumption we get that the net loss at the end of the game will be at most r1 r 1 vm r1, r 1 r vmr Z r, r 1 vm r 1 Z r, r 1 1 r 1 vm r Z r, r 1 r r vmr Z r, r 1 vm r 1 Z r, r 1 Similarly, if y t 1, then the loss of MM at the next step is r1 r 1 ŷ, and the loss of the experts is M r1 M r 1 Z r, and we get that the net loss at the end of the game will be at most r1 r 1 vm r1, r 1 r 1 vmr Z r, r 1 vm r 1 Z r, r 1 1 r 1 vm r 1 Z r, r 1 r r vmr Z r, r 1 vm r 1 Z r, r 1 Thus, for either value of y t {0, 1}, we have that L MM y L y r r vmr Z r, r 1 vm r 1 Z r, r 1..

11 r max r vmr Z, r 1 vmr 1 Z, r 1 Z0,1 N r r min Z0,1 N vm r Z, r 1 vm r 1 Z, r r r vmr, r. (5) This completes the induction, and the proof of the first part of the theorem. The proof of the lower bound proceeds similarly. Let A be any prediction strategy, let r be the number of trials remaining, let M r be the vector describing the loss of each expert up to the current trial when r trials remain, and let r be the loss incurred by A up to this current trial. The natural adversarial choice for the experts predictions on the current trial t is any vector Z r 1,t,..., N,t which minimizes the right-hand side of Eq. () (the definition of v(m r, r)). If ŷ t is A s prediction, then the adversary chooses the outcome y t that maximizes A s loss on the trial, ŷ t y t. We prove by induction on r that this adversary forces the net loss of any algorithm to be at least L A y L y r r vmr, r. As above, equality holds when r 0. For the inductive step, let t be the trial number when r trials remain. Recall that r1 is either r ŷ t or r 1 ŷ t and that M r1 is either M r Z r or M r 1 Z r depending on the value of y t. Thus, by the inductive hypothesis and the definition of the adversary L A y L y max r ŷ t r 1 vm r Z r, r 1, r 1 ŷ t r 1 vm r 1 Z r, r 1 vm r Z r, r 1 r 1 ŷ t r 1 v(m r 1 Z r, r 1) 1 r ŷ t r 1 r r vmr Z r, r 1 vm r 1 Z r, r 1

12 438 N. CESA-BIANCHI ET AL. r r vmr, r. This completes the induction. Choosing r gives the stated lower bound. e We have thus proven that the prediction strategy MM, described above, achieves the optimal bounds on the net-loss of any prediction strategy. However, in order to use this strategy as a prediction algorithm we need to describe how to calculate the values v(m, r). At first, this calculation might seem forbiddingly complex, as it involves minimizing a recursively defined function over all choices of Z in the continuous domain [0, 1] N. Fortunately, as we now show, the minimal value is always achieved at one of the corner points of the cube Z {0, 1} N, so that the minimization search space is finite, albeit exponential. We prove this claim using the following lemma: LEMMA 3.3. For any fixed 0 r, the function v(m, r) is concave, that is, for any 0 1, and for any X, Y ( ) N : vx 1 Y, r vx, r 1 vy, r. PROOF. As usual, we prove the lemma by induction on r. For r 0, suppose i 0 is the index that minimizes vx 1 Y, 0 min x i 1 y i. 1iN Then the convex combination of v(x, 0)andv(Y, 0) can be bounded as follows: min x i 1 min y i x i0 1 y i0 vx 1 Y, 0. 1iN 1iN For r 0, let Z 0 [0, 1] N be a choice of the argument that minimizes vx 1 Y, r min Z0,1 N Then we get vx 1 Y, r vx 1 Y Z, r 1 vx 1 Y 1 Z, r 1 vx 1 Y Z 0, r 1 vx 1 Y 1 Z 0, r 1 vx Z 0 1 Y Z 0, r 1 vx 1 Z 0 1 Y 1 Z 0, r 1 Using the induction assumption we can bound each of the two terms and get that.

13 439 vx 1 Y, r vx Z 0, r 1 1 vy Z 0, r 1 vx 1 Z 0, r 1 1 vy 1 Z 0, r 1 vx Z 0, r 1 vx 1 Z 0, r 1 1 vy Z 0, r 1 vy 1 Z 0, r 1 min Z0,1 N 1 min vx Z, r 1 vx 1 Z, r 1 Z0,1 N vx, r 1 vy, r. If we fix M and view the function vy Z, r 1 vy 1 Z, r 1 e vm Z, r 1 vm 1 Z, r 1 as a function of Z, we see that it is simply a positive constant times the sum of two concave functions and thus it also is concave. Therefore, the minimal value of this function over the closed cube Z [0, 1] N is achieved in one of the corners of the cube. This means that the function v(m, r) can be computed recursively by minimizing over the N (Boolean) choices of the experts predictions. Each of these choices involves two recursive calls and the recursion has to be done to depth r. Therefore, a total of r(n 1) recursive calls are made, requiring time O(N r(n 1) ). Dynamic programming leads to a better algorithm for calculating v(m, r). However, it is still exponential in N. An interesting question is whether v(m, r) can be computed efficiently. To summarize this section, we have described an optimal prediction algorithm and given a recursive formula which defines its worst case loss, and thereby obtained a recursive formula for V N,. We do not have a closed-form equation for V N,. However, we can always calculate it exactly in finite time (see Figure 5 for the values of V N, for some small ranges of N and ). Moreover, the following section provides a simple adversarial strategy that generates a lower bound on the optimal net loss V N, and Section 4 provides a simple prediction algorithm that generates an upper bound on V N,. As we will see, these two bounds are quite tight PREDICTION USING STATIC EXPERTS. The strategy described above can be refined to handle certain special cases. As an example of this technique, we show

14 440 N. CESA-BIANCHI ET AL. in this section how to handle the case that all the experts are static in the sense that their predictions do not depend either on the observed outcomes or on the learner s predictions. 7 That is, each expert can be viewed formally as a function i : {1,..., } 3 [0, 1] with the interpretation that the prediction at time t is i,t i (t). We assume further that the learner knows this function and thus can compute the future predictions of all the experts. Thus, the adversary must choose the static experts at the beginning of the game and reveal this choice to the learning algorithm. The adversary still chooses each outcome y t on-line as before. The resulting game is called the binary sequence prediction game with static experts and its min/max value is denoted V (static) N,. Since this game is easier for the minimizing player (the predictor) than the general game, it is clear that V (static) N, V N,. When N, the values of the two games are the same for all. However, a calculation shows that V (static) 3,4 V 3,4 with strict inequality, so the general sequence prediction game is actually harder in the worst case than the same game with static experts. The actual values are (static) 1 and V 3,4 17 V 3,4 16. We give below a characterization of the optimal prediction and adversarial strategies for the binary sequence prediction game with static experts. In fact we go further and analyze the game explicitly for every possible choice of the static experts. The resulting min/max values have a simple geometric interpretation. For real vectors x and y of length, let x y 1 t1 x t y t. Let { 1,..., N } be a set of N static experts. For any expert i, its loss on the bit t1 sequence y is i (t) y t i y 1, viewing i as a vector in [0, 1]. Thus L (y) min i i y 1. We define the average covering radius of, denoted R(), as the average l 1 distance from a bit sequence y to the nearest expert in, that is R E y L y E y min i i y 1, where E y denotes expectation over a uniformly random choice of y {0, 1}. We will use the following convexity result, an analog of Lemma 3.3. LEMMA Let { i } and { i } be two sets of N vectors in [0, 1] and let 0 1. Then R 1 R 1 R, where (1 ) is the set of N vectors { i (1 ) i }. PROOF R 1 E y min i E y min i t t i,t 1 i,t y t i,t y t 1 i,t 1 y t 7 In an earlier version of this paper [Cesa-Bianchi et al. 1993], we incorrectly claimed that the same analysis also applied to all simulatable experts, that is, experts whose predictions can be calculated as a function only of the preceding outcomes.

15 441 E y min i y 1 1 i y 1 i E y min i i y 1 1 min i R 1 R, i y 1 where the second equality follows from a case analysis of y t 0 and y t 1, combined with the fact that i,t, i,t [0, 1]. e THEOREM Let be a set of static experts whose current and future predictions are accessible to the prediction algorithm. Then there exists a prediction strategy MS such that for every sequence y, we have L MS y L y R. Moreover, MS is optimal in the sense that for every prediction strategy A, there exists a sequence y such that L A y L y R. Hence V static N, min R, where the minimum is over all sets of N vectors in {0, 1}. PROOF. For any prediction strategy A, the expected value of L A L with respect to a uniformly random choice of y {0, 1} is simply / R() since we expect any algorithm to have loss / on an entirely random sequence, and R() is the expected loss of the best expert in. Thus, there must be some sequence y for which L A (y) L (y) is at least as great as this expectation; this proves the second part of the theorem. The first part of the theorem can be proved using the technique in Section 3 with only minor modifications, which we sketch briefly. First, the function v is redefined to take account of the fact that the experts predictions are prespecified. As the predictions of the experts correspond to vectors in [0, 1],wecan think about them as rows in an N matrix. We can calculate the average covering radius by considering one column (i.e., game iteration) at a time. That is, we define the new function ṽ as follows: ṽm, 0min i M i ṽm, r ṽm Zr, r 1 ṽm 1 Z r, r 1 where Z r 1,t,..., N,t is the experts predictions at trial t r 1.

16 44 N. CESA-BIANCHI ET AL. The (re)proof of Lemma 3.1 for ṽ is similar, except that we no longer minimize over Z [0, 1] N, and in the case that r 0, Z 0 is replaced by Z r. The new prediction strategy MS computes its prediction at time t r 1 as before with the obvious changes: ŷ t ṽmr Z r, r 1 ṽm r 1 Z r, r 1 1 The induction argument given in the first part of the proof of Theorem 3. holds with little modification. The function v is obviously replaced by ṽ, and the inductive hypothesis given by Eq. (4) is modified so that equality holds for every outcome sequence:. L MS y L y r r ṽmr, r. Also, Eq. (5) becomes the equality: L MS y L y r r ṽmr Z r, r 1 ṽm r 1 Z r, r 1 r r ṽmr, r. By expanding ṽ(0, ) according to the recursive definition, we find that ṽ0, 1 Z r 1 y r1 1 Z r y r1,0 1 ṽ y0, 1 r1 y0, 1 ṽ i y 1 i1 N,0 1 min i y 1 y0, 1 i E y min i y 1 R. i Finally, it follows directly from the first two statements of the theorem that V static N, inf R, where the infimum is over all sets of N vectors in [0, 1]. However, in light of Lemma 3.1.1, R() must be minimized by some extremal, that is, by {0, 1}. The last statement of the theorem follows. e Theorem 3.1. tells us how to compute the worst-case performance of the best possible algorithm for any set of static experts. As an example of its usefulness, suppose that consists of only two experts, one that always predicts 0, and the

17 443 other always predicting 1. In this case Theorem 3.1. implies that the loss of the optimal algorithm MS is worse than the loss of the best expert by the following amount: i0 i mini, i. This result was previously proved by Cover [1965]; we obtain it as a special case. Strategy MS makes each prediction in terms of the expected loss of the best expert on the remaining trials (where the expectation is taken over the uniformly random choice of outcomes for these trials). This is why we need the experts to be static. In general, we do not know how to efficiently compute this expectation exactly. However, the expectation can be estimated by sampling a polynomial number of randomly chosen future outcome sequences. Thus, there exists an efficient randomized variation of MS that is arbitrarily close to optimal. 3.. AN ASYMPTOTIC LOWER BOUND ON V N,. We now use Theorem to give an asymptotic lower bound on the performance of any prediction algorithm. To do this, we need to show that there are sets of N vectors in {0, 1} with small R(). We do this with a random construction, using the following lemma: LEMMA For each, N 1, let S,1,...,S,N be N independent random variables, where S,i is the number of heads in independent tosses of a fair coin. Let A,N min 1iN {S,i }. Then lim inf N3 lim inf 3 /EA,N /ln N 1. PROOF. See Appendix A. e From this we get COROLLARY 3... For all N,, let R N, min R(), where the minimum is over all {0, 1} of cardinality N. Then lim inf N3 lim inf 3 /R N, /ln N 1. PROOF. Clearly min R ER EA,N, where the expectation is over the independent random choice of N binary vectors in, and A,N is as defined in Lemma Hence, the result follows directly from that lemma. e Finally, we obtain

18 444 N. CESA-BIANCHI ET AL. Algorithm P() 1. All initial weights {w 1,1,...,w N,1 } are set to 1.. At each time t, for t 1to, the algorithm receives the predictions of the N experts, 1,t,..., N,t, and computes its prediction ŷ t as follows: Y Compute Y Output prediction ŷ t F (r t ). N w i1 i,t i,t r t : N w i1 i,t 3. After the correct outcome y t is observed, the weight vector is updated in the following way. Y For each i 1toN,w i,t1 w i,t U ( i,t y t ). Definition of F (r) and U (q). There is some flexibility in defining the functions F (r) and U (q) used in the algorithm. Any functions F (r) and U (q) such that for all 0 r 1, and 1 ln((1r)r) ln(/(1)) F (r) ln(1rr) ln(/(1)), (6) q U q 1 1 q, (7) for all 0 q 1, will achieve the performance bounds established below. FIG.. Description of algorithm P(), with parameter 0 1. THEOREM 3..3 lim inf N3 lim inf 3 V N, / ln N lim inf N3 lim inf 3 static V N, / ln N 1. PROOF. Follows Corollary 3.., Theorem 3.1., and the fact that V N, e V (static) N,. Hence, for any 0, there exist sufficiently large N and such that V N, 1 / ln N. 4. Some Simple Prediction Algorithms In this section, we present a parameterized prediction algorithm P for combining the predictions of a set of experts. Unlike the optimal strategy outlined in Section 3, algorithm P can be implemented efficiently. The analysis of P will give an upper bound for the min/max value V N, that asymptotically matches the lower bound derived in the previous section THE ALGORITHM P. The prediction algorithm P is given in Figure. It works by maintaining a (nonnegative) weight for each expert. The weight of expert i at time t is denoted w i,t. At each time t, the algorithm receives the experts predictions, 1,t,..., N,t, and computes their weighted average, r t.

19 445 Algorithm P then makes a prediction that is some function of this weighted average. Then P receives the correct value y t and slashes the weight of each expert i by a multiplicative factor depending on how well that expert predicts, as measured by i,t y t. The worse the prediction of the expert, the more that expert s weight is reduced. Algorithm P takes one parameter, a real number [0, 1) which controls how quickly the weights of poorly predicting experts drop. For small, the algorithm quickly slashes the weights of poorly predicting experts and starts paying attention only to the better predictors. For closer to 1, the weights will drop slowly, and the algorithm will pay attention to a wider range of predictors for a longer time. The best value for depends on the circumstances. Later, we derive good choices of for different types of prior knowledge the algorithm may have. There are two places where the algorithm can choose to use any real value within an allowed range. We have represented these choices by the functions F and U, with ranges given by Eqs. (6) and (7), respectively, in Figure. These are called the prediction and update functions, respectively. In terms of our analysis, the exact choice for these functions is not important, as long as they lie in the allowed range. In fact, different choices could be made at different times. The following lemma shows that these ranges are nonempty. LEMMA For any 0 1 and 0 a 1, ln1 a a (1) 1 ln /1 () a 1 a(1 ). ln1 a a ln /1 PROOF. We begin by proving part (1). The inequality can be rewritten as ln a a1 a a 1 0. ln/1 Since 0 1, this is in turn equivalent to Exponentiating both sides yields ln a1 a a ln 1. a a1 a a 1 which holds since xy ((x y)/) for all real x and y (here we take x a a and y 1 a a). To prove part (), notice that f(a) a is convex since it has nonnegative second derivative for all 0. Thus, by definition of convex function, fx 0 1 x 1 f x 0 1 f x 1 for all x 0, x 1 and all 0 1. The proof is then concluded by choosing x 0 0, x 1 1, and 1 a. e

20 446 N. CESA-BIANCHI ET AL. 4.. THE PERFORMANCE OF ALGORITHM P(). Algorithm P s performance is summarized by the following theorem, which generalizes a similar result of Vovk [1990]. THEOREM For any 0 1, for any set of N experts, and for any binary sequence y of length, the loss of P() satisfies L P y ln N L y ln. ln /1 The proof of the theorem is based on the following lemma. LEMMA 4.. N 1 L P y ln /1 ln i1 w i,1 N w i,1. PROOF. We will show that for 1 t, i1 ŷ t y t N 1 ln /1 ln i1 w N w i,t1. (8) i1 The lemma then follows from summing the above inequality for t 1,...,. We first lower bound the numerator of the right-hand-side of the above inequality: N ln i1 N i1 N w w i,t1 ln i1 w i,t U i,t y t N ln i1 N i1 w i,t w i,t 1 1 i,t y t N i1 w i,t ln1 1 r t y t, where the inequality follows from Eq. (7), and the last equality is verified by a case analysis using the fact that y t {0, 1}. Thus, Eq. (8) is implied by ŷ t y t ln1 1 r t y t ln/1 The above splits into two inequalities since y t is either 0 or 1. These two inequalities are the same as the two inequalities of (6) which we assumed for the prediction function. e PROOF OF THEOREM All initial weights equal 1 and thus N i1 w i,1 N. Let j be an expert with minimum total loss on y, that is, j,t y t L (y). Since, by Eq. (7), U (q) q, we have that t1 N i1 w i,1 w j,1 w j,1 U j,t y t t1.

Subspace Pursuit for Compressive Sensing: Closing the Gap Between Performance and Complexity Wei Dai and Olgica Milenkovic Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign

WHICH SCORING RULE MAXIMIZES CONDORCET EFFICIENCY? DAVIDE P. CERVONE, WILLIAM V. GEHRLEIN, AND WILLIAM S. ZWICKER Abstract. Consider an election in which each of the n voters casts a vote consisting of

COSAMP: ITERATIVE SIGNAL RECOVERY FROM INCOMPLETE AND INACCURATE SAMPLES D NEEDELL AND J A TROPP Abstract Compressive sampling offers a new paradigm for acquiring signals that are compressible with respect

If You re So Smart, Why Aren t You Rich? Belief Selection in Complete and Incomplete Markets Lawrence Blume and David Easley Department of Economics Cornell University July 2002 Today: June 24, 2004 The

Two faces of active learning Sanjoy Dasgupta dasgupta@cs.ucsd.edu Abstract An active learner has a collection of data points, each with a label that is initially hidden but can be obtained at some cost.

Foundations of Data Science John Hopcroft Ravindran Kannan Version /4/204 These notes are a first draft of a book being written by Hopcroft and Kannan and in many places are incomplete. However, the notes

Some Optimal Inapproximability Results JOHAN HÅSTAD Royal Institute of Technology, Stockholm, Sweden Abstract. We prove optimal, up to an arbitrary ɛ>0, inapproximability results for Max-Ek-Sat for k 3,

1 Object Detection with Discriminatively Trained Part Based Models Pedro F. Felzenszwalb, Ross B. Girshick, David McAllester and Deva Ramanan Abstract We describe an object detection system based on mixtures

No Free Lunch in Data Privacy Daniel Kifer Penn State University dan+sigmod11@cse.psu.edu Ashwin Machanavajjhala Yahoo! Research mvnak@yahoo-inc.com ABSTRACT Differential privacy is a powerful tool for

Truthful Mechanisms for One-Parameter Agents Aaron Archer Éva Tardos y Abstract In this paper, we show how to design truthful (dominant strategy) mechanisms for several combinatorial problems where each