Notes from Week 1: Algorithms for sequential prediction

Transcription

1 CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 1: Algorithms for sequential prediction Instructor: Robert Kleinberg Jan Introduction In this course we will be looking at online algorithms for learning and prediction. These algorithms are interesting in their own right as a topic in theoretical computer science but also because of their role in the design of electronic markets (e.g. as algorithms for sequential price experimentation, or for online recommendation systems and their role in game theory (where online learning processes have been proposed as an explanation for how players learn to play an equilibrium of a game. 2 Online algorithms formalism For general background on online algorithms, one can look at the book Online Computation and Competitive Analysis by Borodin and El-Yaniv, or read the notes from an online algorithms course taught by Michel Goemans at MIT, available by FTP at ftp://theory.csail.mit.edu/pub/classes/18.415/notes-online.ps In this section we give an abstract definition of online algorithms, suitable for the prediction problems we have studied in class. Definition 1. An online computation problem is specified by: 1. A set of inputs I = I t. 2. A set of outputs O = O t. 3. A cost function Cost : I O R. For a positive integer T, we will define I [T ] = T I t, O[T ] = T O t. W1-1

2 One should interpret an element i = (i 1, i 2,... I as a sequence representing the inputs revealed to the algorithm over time, with i t representing the part of the input revealed at time t. Similarly, one should interpret an element o = (o 1, o 2,... as a sequence of outputs produced by the algorithm, with o t being the output at time t. Remark 1. The definition frames online computation problems in terms of an infinite sequence of inputs and outputs, but it is easy to incorporate problems with a finite time horizon T as a special case of the definition. Specifically, if I t = O t = 1 for all t > T then this encodes an input-output sequence in which no information comes into or out of the algorithm after time T. Definition 2. An online algorithm is a sequence of functions F t : I [t] O t. An adaptive adversary (or, simply, adversary is a sequence of functions G t : O[t 1] I t. An adversary is called oblivious if each of the functions G t is a constant function. If F is an online algorithm and G is an adaptive adversary, the transcript of F and G is the unique pair Trans(F, G = (i, o I O such that for all t 1, i t = G t (o 1, o 2,..., o t 1 o t = F t (i 1, i 2,..., i t. The cost of F and G is Cost(F, G = Cost(Trans(F, G. One should think of the algorithm and adversary as playing a game in which the adversary specifies a component of the input based on the algorithm s past outputs, and the algorithm responds by producing a new output. The transcript specifies the entire sequence of inputs and outputs produced when the algorithm and adversary play this game. Remark 2. Designating an oblivious adversary is equivalent to designating a single input sequence i = (i 1, i 2,... I. Remark 3. Our definition of algorithm and adversary makes no mention of computational constraints (e.g. polynomial-time computation for either party. In general we will want to design algorithms which are computationally efficient, but it is possible to ask meaningful and non-trivial questions about online computation without taking such constraints into account. In defining randomized algorithms and adversaries, we think of each party as having access to infinitely many independent random bits (represented by the binary digits of a uniformly distributed element of [0, 1] which are not revealed to the other party. W1-2

3 Definition 3. A randomized online algorithm is a sequence of functions F t : I [t] [0, 1] O t. A randomized adaptive adversary is a sequence of functions G t : O[t 1] [0, 1] I t. A randomized adversary is called oblivious if the output of each function G t (o, y depends only on the parameter y. If F and G are a randomized algorithm and randomized adaptive adversary, respectively, then the transcript of F and G is the function Trans(F, G : [0, 1] [0, 1] I O which maps a pair (x, y to the unique input-output pair (i, o satisfying: i t = G t (o 1, o 2,..., o t 1, y o t = F t (i 1, i 2,..., i t, x for all t 1. The cost of F and G is Cost(F, G = E[Cost(Trans(F, G(x, y], when the pair (x, y is sampled from the uniform distribution on [0, 1] 2. Remark 4. A randomized oblivious adversary is equivalent to a probability distribution over input sequences i = (i 1, i 2,... I. Remark 5. In class I defined a randomized algorithm using an infinite sequence of independent random variables (x 1, x 2,... [0, 1], and similarly for a randomized adversary. Consequently the transcript Trans(F, G was described as a function from [0, 1] [0, 1] to I O. This was unnecessarily complicated: a single random number x [0, 1] contains infinitely many independent random binary digits, so it already contains as much randomness as the algorithm would need for an entire infinite sequence of input-output pairs. Accordingly, in these notes I have simplified the definition by assuming that the algorithm s and adversary s random bits are contained in a single pair of independent random real numbers (x, y, with x representing the algorithm s supply of random bits and y representing the adversary s supply. 3 Binary prediction with one perfect expert As a warm-up for the algorithms to be presented below, let s consider the following toy problem. The algorithm s goal is to predict the bits of an infinite binary sequence B = (B 1, B 2,..., whose bits are revealed one at a time. Just before the t-th bit is revealed, a set of n experts make predictions b 1t, b 2t,..., b nt {0, 1}. The algorithm is allowed to observe all of these predictions, then it makes a guess denoted by a t {0, 1}, and then the truth, B t, is revealed. We are given a promise that there is at least one expert whose predictions are always accurate, i.e. we are promised that i t b it = B t. W1-3

4 This prediction problem is a special case of the framework described above. Here, I t = {0, 1} {0, 1} n and O t = {0, 1}. The input i t contains all the information revealed to the algorithm after it makes its (t 1-th guess and before it makes its t-th guess: thus i t consists of the value of B t 1 together with all the predictions b 1t,..., b nt. The output o t is simply the algorithm s guess a t. The cost Cost(i, o is the number of times t such that a t B t. Consider the following algorithm, which we will call the Majority algorithm : at each time t, it consults the predictions of all experts who did not make a mistake during one of the first t 1 steps. (In other words, it considers all experts i such that b is = B s for all s < t. If more of these experts predict 1 than 0, then a t = 1; otherwise a t = 0. Theorem 1. Assuming there is at least one expert i such that b it = B t for all t, the Majority algorithm makes at most log 2 (n mistakes. Proof. Let S t denote the set of experts who make no mistakes before time t. Let W t = S t. If the Majority algorithm makes a mistake at time t, it means that at least half of the experts in S t made a mistake at that time, so W t+1 W t /2. On the other hand, by assumption we have W t 1 for all t. Thus the number of mistakes made by the algorithm is bounded above by the number of iterations of the function x x/2 required to get from n down to 1. This is log 2 (n. Remark 6. The bound of log 2 (n in Theorem 1 is information-theoretically optimal, i.e. one can prove that no deterministic algorithm makes strictly fewer than log 2 (n mistakes on every input. Remark 7. Although the proof of Theorem 1 is very easy, it contains the two essential ingredients which will reappear in the analysis of the Weighted Majority and Hedge algorithms below. Namely, we define a number W t which measures the remaining amount of credibility of the set of experts at time t, and we exploit two key properties of W t : When the algorithm makes a mistake, there is a corresponding multiplicative decrease in W t. The assumption that there is an expert whose predictions are close to the truth implies a lower bound on the value of W t for all t. The second property says that W t can t shrink too much starting from its initial value of n; the first property says that if W t doesn t shrink too much then the algorithm can t make too many mistakes. Putting these two observations together results in the stated mistake bound. Each of the remaining proofs in these notes also hinges on these two observations, although the manipulations required to justify the two observations become more sophisticated as the algorithms we are analyzing become more sophisticated. W1-4

5 Algorithm WMA(ε /* Initialization */ w i 1 for i = 1, 2,..., n. /* Main loop */ for t = 1, 2,... /* Make prediction by taking weighted majority vote */ if i : b it =0 w i > i : b it =1 w i output a t = 0; else output a t = 1. Observe the value of B t. /* Update weights multiplicatively */ E t {experts who predicted incorrectly} w i (1 ε w i for all i E t. end Figure 1: The weighted majority algorithm 4 Deterministic binary prediction: the Weighted Majority Algorithm We now present an algorithm for the same binary prediction problem discussed in Section 3. This new algorithm, the Weighted Majority algorithm, satisfies a provable mistake bound even when we don t assume that there is an expert who never makes a mistake. The algorithm is shown in Figure 1. It is actually a one-parameter family of algorithms WMA(ε, each with a preconfigured parameter ε (0, 1. Theorem 2. Let M denote the number of mistakes made by the algorithm WMA(ε. For every integer m, if there exists an expert i which makes at most m mistakes, then ( ( 2 2 M < m + ln(n. 1 ε ε Proof. Let w it denote the value of w i at the beginning of the t-th iteration of the main loop, and let W t = n i=1 w it. The hypothesis implies that there is an expert i such that w it (1 ε m for all T, so W T > w it (1 ε m (1 W1-5

7 Algorithm Hedge(ε /* Initialization */ w x 1 for x [n] /* Main loop */ for t = 1, 2,... /* Define distribution for sampling random strategy */ for x [n] /( n p t (x w x y=1 w y end Choose x t [n] at random according to distribution p t. Observe cost function c t. /* Update score for each strategy */ for x [n] w x w x (1 ε ct(x end end Figure 2: The algorithm Hedge(ε. with the set [n] = {1, 2,..., n}. In each time step t, the adversary designates a cost function c t from [n] to [0, 1], and the algorithm chooses an expert x t [n]. The cost function C t is revealed to the algorithm only after it has chosen x t. The algorithm s objective is to minimize the sum of the costs of the chosen experts, i.e. to minimize c t(x t. Observe that this problem formulation fits into the formalism specified in Section 2; the input sequence (i 1, i 2,... is given by i t = c t 1, the output sequence (o 1, o 2,... is given by o t = x t, and the cost function is Cost(i, o = i t+1 (o t = c t (x t. Also observe that the binary prediction problem is a special case of the best expert problem, in which we define c t (x = 1 if b xt B t, 0 otherwise. Figure 2 presents a randomized online algorithm for the best expert problem. As before, it is actually a one-parameter family of algorithms Hedge(ε with a preconfigured parameter ε (0, 1. Note the algorithm s similarity to WMA(ε: it maintains a vector of weights, one for each expert, and it updates these weights multiplicatively using a straightforward generalization of the multiplicative update rule in WMA. The W1-7

10 Proof. Take the natural logarithm of both sides of (20. Lemma 6. For all real numbers y (0, 1, ( 1 1 y ln < 1 1 y 1 y. (22 Proof. Apply (21 with x = y, then divide both sides by y. 1 y Lemma 7. For every pair of real numbers x [0, 1], ε (0, 1, with equality if and only if x = 0 or x = 1. (1 ε x 1 εx (23 Proof. The function y = (1 ε x is strictly convex and the line y = 1 εx intersects it at the points (0, 1 and (1, 1 ε. Lemma 8. For every random variable X, we have E(ln(X ln(e(x (24 with equality if and only if there is a constant c such that Pr(X = c = 1. Proof. Jensen s inequality for convex functions says that if f is a convex function and X is a random variable, E(f(X f(e(x, and that if f is strictly convex, then equality holds if and only if there is a constant c such that Pr(X = c = 1. The lemma follows by applying Jensen s inequality to the strictly convex function f(x = ln(x. W1-10

Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount

Chapter 2 The Multiplicative Weights Update method The Multiplicative Weights method is a simple idea which has been repeatedly discovered in fields as diverse as Machine Learning, Optimization, and Game

1 Introduction Most of the course is concerned with the batch learning problem. In this lecture, however, we look at a different model, called online. Let us first compare and contrast the two. In batch

Section 3 Sequences and Limits, Continued. Lemma 3.6 Let {a n } n N be a convergent sequence for which a n 0 for all n N and it α 0. Then there exists N N such that for all n N. α a n 3 α In particular

Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the

6.2 Permutations continued Theorem A permutation on a finite set A is either a cycle or can be expressed as a product (composition of disjoint cycles. Proof is by (strong induction on the number, r, of

Lecture Notes on Elasticity of Substitution Ted Bergstrom, UCSB Economics 210A March 3, 2011 Today s featured guest is the elasticity of substitution. Elasticity of a function of a single variable Before

CHAPTER 2 Inequalities In this section we add the axioms describe the behavior of inequalities (the order axioms) to the list of axioms begun in Chapter 1. A thorough mastery of this section is essential

Definition 1: GROUPS An operation on a set G is a function : G G G. Definition 2: A group is a set G which is equipped with an operation and a special element e G, called the identity, such that (i) the

Mechanisms for Fair Attribution Eric Balkanski Yaron Singer Abstract We propose a new framework for optimization under fairness constraints. The problems we consider model procurement where the goal is

How Not to Win a Million Dollars: A Counterexample to a Conjecture of L. Breiman Thomas P. Hayes arxiv:1112.0829v1 [math.pr] 5 Dec 2011 Abstract Consider a gambling game in which we are allowed to repeatedly

24. The Branch and Bound Method It has serious practical consequences if it is known that a combinatorial problem is NP-complete. Then one can conclude according to the present state of science that no

Taylor Polynomials and Taylor Series Math 26 In many problems in science and engineering we have a function f(x) which is too complicated to answer the questions we d like to ask. In this chapter, we will

CS787: Advanced Algorithms Lecture 14: Online algorithms We now shift focus to a different kind of algorithmic problem where we need to perform some optimization without knowing the input in advance. Algorithms

Advanced Course in Machine Learning Spring 2011 Online Learning Lecturer: Shai Shalev-Shwartz Scribe: Shai Shalev-Shwartz In this lecture we describe a different model of learning which is called online

Lecture 7: Approximation via Randomized Rounding Often LPs return a fractional solution where the solution x, which is supposed to be in {0, } n, is in [0, ] n instead. There is a generic way of obtaining

Lecture Notes on Elasticity of Substitution Ted Bergstrom, UCSB Economics 20A October 26, 205 Today s featured guest is the elasticity of substitution. Elasticity of a function of a single variable Before

Appendix F: Mathematical Induction Introduction In this appendix, you will study a form of mathematical proof called mathematical induction. To see the logical need for mathematical induction, take another

CMSC 858T: Randomized Algorithms Spring 2003 Handout 8: The Local Lemma Please Note: The references at the end are given for extra reading if you are interested in exploring these ideas further. You are

12th International Congress on Insurance: Mathematics and Economics July 16-18, 2008 A Uniform Asymptotic Estimate for Discounted Aggregate Claims with Subexponential Tails XUEMIAO HAO (Based on a joint

Chapter 21: The Discounted Utility Model 21.1: Introduction This is an important chapter in that it introduces, and explores the implications of, an empirically relevant utility function representing intertemporal

Regular Languages and Finite Automata 1 Introduction Hing Leung Department of Computer Science New Mexico State University Sep 16, 2010 In 1943, McCulloch and Pitts [4] published a pioneering work on a

Chapter 7 Sealed-bid Auctions An auction is a procedure used for selling and buying items by offering them up for bid. Auctions are often used to sell objects that have a variable price (for example oil)

Find-The-Number 1 Find-The-Number With Comps Consider the following two-person game, which we call Find-The-Number with Comps. Player A (for answerer) has a number x between 1 and 1000. Player Q (for questioner)

INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem

Kolmogorov Complexity and the Incompressibility Method Holger Arnold 1. Introduction. What makes one object more complex than another? Kolmogorov complexity, or program-size complexity, provides one of

Solutions for Practice problems on proofs Definition: (even) An integer n Z is even if and only if n = 2m for some number m Z. Definition: (odd) An integer n Z is odd if and only if n = 2m + 1 for some

Influences in low-degree polynomials Artūrs Bačkurs December 12, 2012 1 Introduction In 3] it is conjectured that every bounded real polynomial has a highly influential variable The conjecture is known

The Complexity of Online Memory Checking Moni Naor Guy N. Rothblum Abstract We consider the problem of storing a large file on a remote and unreliable server. To verify that the file has not been corrupted,

ALMOST COMMON PRIORS ZIV HELLMAN ABSTRACT. What happens when priors are not common? We introduce a measure for how far a type space is from having a common prior, which we term prior distance. If a type

CS 05: Algorithms (Grad) Feb 2-24, 2005 Approximating Set Cover. Definition An Instance (X, F ) of the set-covering problem consists of a finite set X and a family F of subset of X, such that every elemennt

THE FUNDAMENTAL THEOREM OF ARBITRAGE PRICING 1. Introduction The Black-Scholes theory, which is the main subject of this course and its sequel, is based on the Efficient Market Hypothesis, that arbitrages

Introduction to computer science Michael A. Nielsen University of Queensland Goals: 1. Introduce the notion of the computational complexity of a problem, and define the major computational complexity classes.

CS271 Randomness & Computation Fall 2011 Lecture 22: November 10 Lecturer: Alistair Sinclair Based on scribe notes by Rafael Frongillo Disclaimer: These notes have not been subjected to the usual scrutiny

CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

FACTORING LARGE NUMBERS, A GREAT WAY TO SPEND A BIRTHDAY LINDSEY R. BOSKO I would like to acknowledge the assistance of Dr. Michael Singer. His guidance and feedback were instrumental in completing this

FACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z DANIEL BIRMAJER, JUAN B GIL, AND MICHAEL WEINER Abstract We consider polynomials with integer coefficients and discuss their factorization