A
n
, and consists of all points in at
least one of A
1
, A
2
. . . , A
n
. Similarly, the intersection of these subsets is denoted by either

n
i=1
A
i
or
9
A
1
A
2
· · · A
n
and consists of all points in all of A
1
, A
2
. . . , A
n
.
A sequence of events is a collection of events in one-to-one correspondence with the positive
integers, i.e., A
1
, A
2
, . . . , ad inﬁnitum. A countable union,

1
i=1
A
i
is the set of points in
one or more of A
1
, A
2
, . . . . Similarly, a countable intersection

1
i=1
A
i
is the set of points
in all of A
1
, A
2
, . . . . Finally, the complement A
c
of a subset (event) A is the set of points
in ⌦ but not A.
1.2.1 Axioms for events
Given a sample space ⌦, the class of subsets of ⌦ that constitute the set of events satisﬁes
the following axioms:
1. ⌦ is an event.
8
A class of elements satisfying these axioms is called a -algebra or, less commonly, a -ﬁeld.
9
Intersection is also sometimes denoted as A1
T
· · ·
T
An, but is usually abbreviated as A1A2 · · · An.
1.2. THE AXIOMS OF PROBABILITY THEORY 7
2. For every sequence of events A
1
, A
2
, . . . , the union

1
n=1
A
n
is an event.
3. For every event A, the complement A
c
is an event.
There are a number of important corollaries of these axioms. First, the empty set ; is an
event. This follows from Axioms 1 and 3, since ; = ⌦
c
. The empty set does not correspond
to our intuition about events, but the theory would be extremely awkward if it were omitted.
Second, every ﬁnite union of events is an event. This follows by expressing A
1

· · ·

A
n
as

1
i=1
A
i
where A
i
= ; for all i > n. Third, every ﬁnite or countable intersection of events
is an event. This follows from De Morgan’s law,
_
_
n
A
n
_
c
=

n
A
c
n
.
Although we will not make a big fuss about these axioms in the rest of the text, we will
be careful to use only complements and countable unions and intersections in our analysis.
Thus subsets that are not events will not arise.
Note that the axioms do not say that all subsets of ⌦ are events. In fact, there are many
rather silly ways to deﬁne classes of events that obey the axioms. For example, the axioms
are satisﬁed by choosing only the universal set ⌦ and the empty set ; to be events. We
shall avoid such trivialities by assuming that for each sample point ., the singleton subset
{.} is an event. For ﬁnite sample spaces, this assumption, plus the axioms above, imply
that all subsets are events.
For uncountably inﬁnite sample spaces, such as the sinusoidal phase above, this assumption,
plus the axioms above, still leaves considerable freedom in choosing a class of events. As an
example, the class of all subsets of ⌦ satisﬁes the axioms but surprisingly does not allow
the probability axioms to be satisﬁed in any sensible way. How to choose an appropriate
class of events requires an understanding of measure theory which would take us too far
aﬁeld for our purposes. Thus we neither assume nor develop measure theory here.
10
From a pragmatic standpoint, we start with the class of events of interest, such as those
required to deﬁne the random variables needed in the problem. That class is then extended
so as to be closed under complementation and countable unions. Measure theory shows
that this extension is possible.
1.2.2 Axioms of probability
Given any sample space ⌦ and any class of events E satisfying the axioms of events, a
probability rule is a function Pr{·} mapping each A 2 E to a (ﬁnite
11
) real number in such
a way that the following three probability axioms
12
hold:
10
There is no doubt that measure theory is useful in probability theory, and serious students of probability
should certainly learn measure theory at some point. For application-oriented people, however, it seems
advisable to acquire more insight and understanding of probability, at a graduate level, before concentrating
on the abstractions and subtleties of measure theory.
11
The word ﬁnite is redundant here, since the set of real numbers, by deﬁnition, does not include ±1.
The set of real numbers with ±1 appended, is called the extended set of real numbers
12
Sometimes ﬁnite additivity, (1.3), is added as an additional axiom. This addition is quite intuitive and
avoids the technical and somewhat peculiar proofs given for (1.2) and (1.3).
8 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
1. Pr{⌦} = 1.
2. For every event A, Pr{A} 0.
3. The probability of the union of any sequence A
1
, A
2
, . . . of disjoint
13
events is given
by
Pr
_
_
1
n=1
A
n
_
=

1
n=1
A
n
_
= lim
n!1
Pr{A
n
} for A
1
◆ A
2
◆ · · · . (1.10)
To verify (1.2), consider a sequence of events, A
1
, A
2
, . . . , for which A
n
= ; for each n. These
events are disjoint since ; contains no outcomes, and thus has no outcomes in common with
itself or any other event. Also,

A
c
. Then apply (1.3) to the disjoint sets A and A
c
.
To verify (1.5), note that if A ✓ B, then B = A

(BA) where BA is an alternate way
to write B

A
c
. We see then that A and B A are disjoint, so from (1.3),
Pr{B} = Pr
_
A
_
(B A)
_
= Pr{A} + Pr{B A} Pr{A} ,
where we have used Axiom 2 in the last step.
13
Two sets or events A1, A2 are disjoint if they contain no common events, i.e., if A1A2 = ;. A collection
of sets or events are disjoint if all pairs are disjoint.
1.3. PROBABILITY REVIEW 9
To verify (1.6) and (1.7), ﬁrst substitute ⌦ for B in (1.5) and then substitute

n
A
n
for A.
Finally, (1.8) is established in Exercise 1.3, part (e), and (1.9) and (1.10) are simple conse-
quences of (1.8).
The axioms specify the probability of any disjoint union of events in terms of the individual
event probabilities, but what about a ﬁnite or countable union of arbitrary events? Exercise
1.3 (c) shows that in this case, (1.3) can be generalized to
Pr
_
_
m
n=1
A
n
_
=

m
n=1
Pr{B
n
} , (1.11)
where B
1
= A
1
and for each n > 1, B
n
= A
n

n1
m=1
A
m
is the set of points in A
n
but
not in any of the sets A
1
, . . . , A
n1
. That is, the sets B
n
are disjoint. The probability of
a countable union of disjoint sets is then given by (1.8). In order to use this, one must
know not only the event probabilities for A
1
, A
2
. . . , but also the probabilities of their
intersections. The union bound, which is derived in Exercise 1.3 (d), depends only on the
individual event probabilities, and gives the following frequently useful upper bound on the
union probability.
Pr
_
_
n
A
n
_


n
Pr{A
n
} (Union bound). (1.12)
1.3 Probability review
1.3.1 Conditional probabilities and statistical independence
Deﬁnition 1.3.1. For any two events A and B with Pr{B} > 0, the conditional probability
of A, conditional on B, is deﬁned by
Pr{A|B} = Pr{AB} /Pr{B} . (1.13)
One visualizes an experiment that has been partly carried out with B as the result. Then,
assuming Pr{B} > 0, Pr{A|B} can be viewed as the probability of A normalized to a
sample space restricted to event B. Within this restricted sample space, we can view B as
the sample space (i.e., as the set of outcomes that remain possible upon the occurrence of
B) and AB as an event within this sample space. For a ﬁxed event B, we can visualize
mapping each event A in the original space to event AB in the restricted space. It is easy
to see that the event axioms are still satisﬁed in this restricted space. Assigning probability
Pr{A|B} to each event AB in the restricted space, it is easy to see that the axioms of
probability are satisﬁed when B is regarded as the entire sample space. In other words,
everything we know about probability can also be applied to such a restricted probability
space.
Deﬁnition 1.3.2. Two events, A and B, are statistically independent (or, more brieﬂy,
independent) if
Pr{AB} = Pr{A} Pr{B} .
10 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
For Pr{B} > 0, this is equivalent to Pr{A|B} = Pr{A}. This latter form corresponds to
our intuitive view of independence, since it says that the observation of B does not change
the probability of A. Such intuitive statements about “observation” and “occurrence” are
helpful in reasoning probabilistically, but sometimes cause confusion. For example, Bayes
law, in the form Pr{A|B} Pr{B} = Pr{B|A} Pr{A}, is an immediate consequence of the
deﬁnition of conditional probability in (1.13). However, if we can only interpret Pr{A|B}
when B is ‘observed’ or occurs ‘before’ A, then we cannot interpret Pr{B|A} and Pr{A|B}
together. This caused immense confusion in probabilistic arguments before the axiomatic
theory was developed.
The notion of independence is of vital importance in deﬁning, and reasoning about, proba-
bility models. We will see many examples where very complex systems become very simple,
both in terms of intuition and analysis, when appropriate quantities are modeled as sta-
tistically independent. An example will be given in the next subsection where repeated
independent experiments are used to understand arguments about relative frequencies.
Often, when the assumption of independence is unreasonable, it is reasonable to assume
conditional independence, where A and B are said to be conditionally independent given C
if Pr{AB|C} = Pr{A|C} Pr{B|C}. Most of the stochastic processes to be studied here are
characterized by various forms of independence orconditional independence.
For more than two events, the deﬁnition of statistical independence is a little more compli-
cated.
Deﬁnition 1.3.3. The events A
1
, . . . , A
n
, n > 2 are statistically independent if for each
collection S of two or more of the integers 1 to n.
Pr
_

i2S
A
i
_
=

i2S
Pr{A
i
} . (1.14)
This includes the entire collection {1, . . . , n}, so one necessary condition for independence
is that
Pr
_

n
i=1
A
i
_
=

n
i=1
Pr{A
i
} . (1.15)
It might be surprising that (1.15) does not imply (1.14), but the example in Exercise 1.5
will help clarify this. This deﬁnition will become clearer (and simpler) when we see how to
view independence of events as a special case of independence of random variables.
1.3.2 Repeated idealized experiments
Much of our intuitive understanding of probability comes from the notion of repeating
the same idealized experiment many times (i.e., performing multiple trials of the same
experiment). However, the axioms of probability contain no explicit recognition of such
repetitions. The appropriate way to handle n repetitions of an idealized experiment is
through an extended experiment whose sample points are n-tuples of sample points from
the original experiment. Such an extended experiment is viewed as n trials of the original
1.3. PROBABILITY REVIEW 11
experiment. The notion of multiple trials of a given experiment is so common that one
sometimes fails to distinguish between the original experiment and an extended experiment
with multiple trials of the original experiment.
To be more speciﬁc, given an original sample space ⌦, the sample space of an n-repetition
model is the Cartesian product
⌦
n
= {(.
1
, .
2
, . . . , .
n
) : .
i
2 ⌦ for each i, 1  i  n}, (1.16)
i.e., the set of all n-tuples for which each of the n components of the n-tuple is an element
of the original sample space ⌦. Since each sample point in the n-repetition model is an
n-tuple of points from the original ⌦, it follows that an event in the n-repetition model is
a subset of ⌦
n
, i.e., a collection of n-tuples (.
1
, . . . , .
n
), where each .
i
is a sample point
from ⌦. This class of events in ⌦
n
should include each event of the form {(A
1
A
2
· · · A
n
)},
where {(A
1
A
2
· · · A
n
)} denotes the collection of n-tuples (.
1
, . . . , .
n
) where .
i
2 A
i
for
1  i  n. The set of events (for n-repetitions) must also be extended to be closed under
complementation and countable unions and intersections.
The simplest and most natural way of creating a probability model for this extended sam-
ple space and class of events is through the assumption that the n-trials are statistically
independent. More precisely, we assume that for each extended event {A
1
⇥A
2
⇥· · · ⇥A
n
}
contained in ⌦
n
, we have
Pr{A
1
⇥A
2
⇥· · · ⇥A
n
} =

n
i=1
Pr{A
i
} , (1.17)
where Pr{A
i
} is the probability of event A
i
in the original model. Note that since ⌦ can
be substituted for any A
i
in this formula, the subset condition of (1.14) is automatically
satisﬁed. In fact, the Kolmogorov extension theorem asserts that for any probability model,
there is an extended independent n-repetition model for which the events in each trial are
independent of those in the other trials. In what follows, we refer to this as the probability
model for n independent identically distributed (IID) trials of a given experiment.
The niceties of how to create this model for n IID arbitrary experiments depend on measure
theory, but we simply rely on the existence of such a model and the independence of events
in di↵erent repetitions. What we have done here is very important conceptually. A proba-
bility model for an experiment does not say anything directly about repeated experiments.
However, questions about independent repeated experiments can be handled directly within
this extended model of n IID repetitions. This can also be extended to a countable number
of IID trials.
1.3.3 Random variables
The outcome of a probabilistic experiment often speciﬁes a collection of numerical values
such as temperatures, voltages, numbers of arrivals or departures in various time intervals,
etc. Each such numerical value varies, depending on the particular outcome of the experi-
ment, and thus can be viewed as a mapping from the set ⌦ of sample points to the set R of
real numbers (note that R does not include ±1). These mappings from sample points to
real numbers are called random variables.
12 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
Deﬁnition 1.3.4. A random variable (rv) is essentially a function X from the sample
space ⌦ of a probability model to the set of real numbers R. Three modiﬁcations are needed
to make this precise. First, X might be undeﬁned or inﬁnite for a subset of ⌦ that has 0
probability.
14
Second, the mapping X(.) must have the property that {. 2 ⌦ : X(.)  x}
is an event
15
for each x 2 R. Third, every ﬁnite set of rv’s X
1
, . . . , X
n
has the property
that for each x
1
2 R, . . . , x
n
2 R, the set {. : X
1
(.)  x
1
, . . . , X
n
 x
n
} is an event .
As with any function, there is often confusion between the function itself, which is called
X in the deﬁnition above, and the value X(.) taken on for a sample point .. This is
particularly prevalent with random variables (rv’s) since we intuitively associate a rv with
its sample value when an experiment is performed. We try to control that confusion here
by using X, X(.), and x, respectively, to refer to the rv, the sample value taken for a given
sample point ., and a generic sample value.
Deﬁnition 1.3.5. The distribution function
16
F
X
(x) of a random variable (rv) X is a
function, R ! R, deﬁned by F
X
(x) = Pr{. 2 ⌦ : X(.)  x}. The argument . is usually
omitted for brevity, so F
X
(x) = Pr{X  x}.
Note that x is the argument of F
X
(x) and the subscript X denotes the particular rv under
consideration. As illustrated in Figure 1.1, the distribution function F
X
(x) is non-decreasing
with x and must satisfy lim
x!1
F
X
(x) = 0 and lim
x!1
F
X
(x) = 1. Exercise 1.6 proves
that F
X
(x) is continuous from the right (i.e., that for every x 2 R, lim
✏#0
F
X
(x+✏) = F
X
(x)).
q
q
1
0
FX(x)
Figure 1.1: Example of a distribution function for a rv that is neither continuous nor
discrete. If F
X
(x) has a discontinuity at some x
o
, it means that there is a discrete
probability at x
o
equal to the magnitude of the discontinuity. In this case F
X
(x
o
) is
given by the height of the upper point at the discontinuity.
Because of the deﬁnition of a rv, the set {X  x} for any rv X and any real number x must
be an event, and thus Pr{X  x} must be deﬁned for all real x.
The concept of a rv is often extended to complex random variables (rv’s) and vector rv’s.
A complex random variable is a mapping from the sample space to the set of ﬁnite complex
14
For example, consider a probability model in which ⌦ is the closed interval [0, 1] and the probability
distribution is uniform over ⌦. If X(!) = 1/!, then the sample point 0 maps to 1but X is still regarded as
a rv. These subsets of 0 probability are usually ignored, both by engineers and mathematicians. Thus, for
example, the set {! 2 ⌦ : X(!)  x} means the set for which X(!) is both deﬁned and satisﬁes X(!)  x.
15
These last two modiﬁcations are technical limitations connected with measure theory. They can usually
be ignored, since they are satisﬁed in all but the most bizarre conditions. However, just as it is important
to know that not all subsets in a probability space are events, one should know that not all functions from
⌦ to R are rv’s.
16
The distribution function is sometimes referred to as the cumulative distribution function.
1.3. PROBABILITY REVIEW 13
numbers, and a vector random variable (rv) is a mapping from the sample space to the
ﬁnite vectors in some ﬁnite-dimensional vector space. Another extension is that of defective
rv’s. A defective rv X is a mapping from the sample space to the extended real numbers,
which satisﬁes the conditions of a rv except that the set of sample points mapped into ±1
has positive probability.
When rv’s are referred to (without any modiﬁer such as complex, vector, or defective), the
original deﬁnition, i.e., a function from ⌦ to R, is intended.
If X has only a ﬁnite or countable number of possible sample values, say x
1
, x
2
, . . . , the
probability Pr{X = x
i
} of each sample value x
i
is called the probability mass function
(PMF) at x
i
and denoted by p
X
(x
i
); such a random variable is called discrete. The dis-
tribution function of a discrete rv is a ‘staircase function,’ staying constant between the
possible sample values and having a jump of magnitude p
X
(x
i
) at each sample value x
i
.
Thus the PMF and the distribution function each specify the other for discrete rv’s.
If the distribution function F
X
(x) of a rv X has a (ﬁnite) derivative at x, the derivative is
called the probability density (or the density) of X at x and denoted by f
X
(x); for suciently
small c, c · f
X
(x) then approximates the probability that X is mapped to a value between
x and x + c. A rv is said to be continuous If there is a function f
X
(x) such that, for each
x 2 R, the distribution function satisﬁes F
X
(x) =
_
x
1
f
X
(y) dy. If such a f
X
(x) exists, it is
called the probability density. Essentially this means that f
X
(x) is the derivative of dF
X
(x),
but it is slightly more general in that it permits f
X
(x) to be discontinuous.
Elementary probability courses work primarily with the PMF and the density, since they are
convenient for computational exercises. We will often work with the distribution function
here. This is partly because it is always deﬁned, partly to avoid saying everything thrice, for
discrete, continuous, and other rv’s, and partly because the distribution function is often
most important in limiting arguments such as steady-state time-average arguments. For
distribution functions, density functions, and PMF’s, the subscript denoting the rv is often
omitted if the rv is clear from the context. The same convention is used for complex or
vector rv’s.
The following tables list some widely used rv’s. If the density or PMF is given only in a
limited region, it is zero outside of that region. The moment generating function, MGF, of
a rv X is E
⇥
e
rX
⇤
and will be discussed in Section 1.3.12.
1.3.4 Multiple random variables and conditional probabilities
Often we must deal with multiple random variables (rv’s) in a single probability experiment.
If X
1
, X
2
, . . . , X
n
are rv’s or the components of a vector rv, their joint distribution function
is deﬁned by
F
X
1
···Xn
(x
1
, . . . , x
n
) = Pr{. 2 ⌦ : X
1
(.)  x
1
, X
2
(.)  x
2
, . . . , X
n
(.)  x
n
} . (1.18)
This deﬁnition goes a long way toward explaining why we need the notion of a sample space
⌦ when all we want to talk about is a set of rv’s. The distribution function of a rv fully
14 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
Name Density f
X
(x) Mean Variance MGF g
X
(r)
Exponential: `exp(`x); x0
1

n
exp()
n!
; n0 ` ` exp[`(e
r
1)]
Table 1.2: The PMF, mean, variance and MGF for some common discrete rv’s.
describes the individual behavior of that rv (and gives rise to its name, as in Tables 1.1 and
1.2), but ⌦ and the above mappings are needed to describe how the rv’s interact.
For a vector rv X with components X
1
, . . . , X
n
, or a complex rv X with real and imaginary
parts X
1
, X
2
, the distribution function is also deﬁned by (1.18). Note that {X
1
 x
1
, X
2

x
2
, . . . , X
n
 x
n
} is an event and the corresponding probability is non-decreasing in each
argument x
i
. Also the distribution function of any subset of random variables is obtained
by setting the other arguments to +1. For example, the distribution of a single rv (called
a marginal distribution for a given joint distribution) is given by
F
X
i
(x
i
) = F
X
1
···X
i1
X
i
X
i+1
···Xn
(1, . . . , 1, x
i
, 1, . . . , 1).
If the rv’s are all discrete, there is a joint PMF which speciﬁes and is speciﬁed by the joint
distribution function. It is given by
p
X
1
...Xn
(x
1
, . . . , x
n
) = Pr{X
1
= x
1
, . . . , X
n
= x
n
} .
Similarly, if the joint distribution function can be di↵erentiated as below, then it speciﬁes
and is speciﬁed by the joint probability density,
f
X
1
...Xn
(x
1
, . . . , x
n
) =
0
n
F(x
1
, . . . , x
n
)
0x
1
0x
2
· · · 0x
n
.
1.3. PROBABILITY REVIEW 15
Two rv’s, say X and Y , are statistically independent (or, more brieﬂy, independent) if
F
XY
(x, y) = F
X
(x)F
Y
(y) for each x 2 R, y 2 R. (1.19)
If X and Y are discrete rv’s , then the deﬁnition of independence in (1.19) is equivalent to
the corresponding statement for PMF’s,
p
XY
(x
i
, y
j
) = p
X
(x
i
)p
Y
(y
j
) for each value x
i
of X and y
j
of Y.
Since {X = x
i
} and {Y = y
j
} are events, the conditional probability of {X = x
i
} conditional
on {Y = y
j
} (assuming p
Y
(y
j
) > 0) is given by (1.13) to be
p
X|Y
(x
i
| y
j
) =
p
XY
(x
i
, y
j
)
p
Y
(y
j
)
.
If p
X|Y
(x
i
| y
j
) = p
X
(x
i
) for all i, j, then it is seen that X and Y are independent. This
captures the intuitive notion of independence better than (1.19) for discrete rv’s , since it
can be viewed as saying that the PMF of X is not a↵ected by the sample value of Y .
If X and Y have a joint density, then (1.19) is equivalent to
f
XY
(x, y) = f
X
(x)f
Y
(y) for each x 2 R, y 2 R. (1.20)
If the joint density exists and the marginal density f
Y
(y) is positive, the conditional density
can be deﬁned as f
X|Y
(x|y) =
f
XY
(x,y)
f
Y
(y)
. In essence, f
X|Y
(x|y) is the density of X conditional
on Y = y, but, being more precise, it is a limiting conditional density as c ! 0 of X
conditional on Y 2 [y, y +c).
If X and Y have a joint density, then statistical independence can also be expressed as
f
X|Y
(x|y) = f
X
(x) for each x 2 R, y 2 R such that f
Y
(y) > 0. (1.21)
This often captures the intuitive notion of statistical independence for continuous rv’s better
than (1.20)
More generally, the probability of an arbitrary event A, conditional on a given value of a
continuous rv Y , is given by
Pr{A | Y = y} = lim
!0
Pr{A, Y 2 [y, y +c]}
Pr{Y 2 [y, y +c]}
.
We next generalize the above results about two rv’s to the case of n rv’s X = X
1
, . . . , X
n
.
Statistical independence is then deﬁned by the equation
F
X
(x
1
, . . . , x
n
) =

n
i=1
Pr{X
i
 x
i
} =

n
i=1
F
X
i
(x
i
) for all x
1
, . . . , x
n
2 R. (1.22)
In other words, X
1
, . . . , X
n
are independent if the events X
i
 x
i
for 1  i  n are
independent for all choices of x
1
, . . . , x
n
. If the density or PMF exists, (1.22) is equivalent
to a product form for the density or mass function. A set of rv’s is said to be pairwise
independent if each pair of rv’s in the set is independent. As shown in Exercise 1.23,
pairwise independence does not imply that the entire set is independent.
Independent rv’s are very often also identically distributed, i.e., they all have the same
distribution function. These cases arise so often that we abbreviate independent identically
distributed by IID. For the IID case (1.22) becomes
F
X
(x
1
, . . . , x
n
) =

n
i=1
F
X
(x
i
).
16 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
1.3.5 Stochastic processes
A stochastic process (or random process
17
) is an inﬁnite collection of rv’sdeﬁned on the
same probability model. These rv’s are usually indexed by an integer or a real number
often interpreted as time.
18
Thus each sample point of the probability model maps to an
inﬁnite collection of sample values of rv’s. If the index is regarded as time, then each sample
point maps to a function of time called a sample path or sample function. These sample
paths might vary continuously with time or might vary only at discrete times, and if they
vary at discrete times, those times might be deterministic or random.
In many cases, this collection of rv’s comprising the stochastic process is the only thing of
interest. In this case, the sample points of the probability model can be taken to be the
sample paths of the process. Conceptually, then, each event is a collection of sample paths.
Often the most important of these events can be deﬁned in terms of a ﬁnite set of rv’s.
As an example of sample paths that change at only discrete times, we might be concerned
with the times at which customers arrive at some facility. These ‘customers’ might be
customers entering a store, incoming jobs for a computer system, arriving packets to a
communication system, or orders for a merchandising warehouse.
The Bernoulli process is an example of how such customers could be modeled and is perhaps
the simplest non-trivial stochastic process. We deﬁne this process shortly and develop a
few of its many properties. We will return to it frequently , both to use it as an example
and to develop additional properties.
1.3.6 The Bernoulli process
A Bernoulli process is a sequence, Z
1
, Z
2
, . . . , of IID binary random variables.
19
Let p =
Pr{Z
i
= 1} and 1 p = Pr{Z
i
= 0}. We often visualize a Bernoulli process as evolving
in discrete time with the event {Z
i
= 1} representing an arriving customer at time i and
{Z
i
= 0} representing no arrival. Thus at most one arrival occurs at each integer time. We
visualize the process as starting at time 0, with the ﬁrst opportunity for an arrival at time
1.
When viewed as arrivals in time, it is interesting to understand something about the intervals
between successive arrivals and about the aggregate number of arrivals up to any given time
(see Figure 1.2). These interarrival times and aggregate numbers of arrivals are rv’s that
are functions of the underlying sequence Z
1
, Z
2
, . . . . The topic of rv’s that are deﬁned as
17
Stochastic and random are synonyms, but random has become more popular for random variables and
stochastic for stochastic processes. The reason for the author’s choice is that the common-sense intuition
associated with randomness appears more important than mathematical precision in reasoning about rv’s,
whereas for stochastic processes, common-sense intuition causes confusion much more frequently than with
rv’s. The less familiar word stochastic warns the reader to be more careful.
18
This deﬁnition is deliberately vague, and the choice of whether to call a sequence of rv’s a process or a
sequence is a matter of custom and choice.
19
We say that a sequence Z1, Z2, . . . , of rv’s are IID if for each integer n, the rv’s Z1, . . . , Zn are IID.
There are some subtleties in going to the limit n ! 1, but we can avoid most such subtleties by working
with ﬁnite n-tuples and going to the limit at the end.
1.3. PROBABILITY REVIEW 17
functions of other rv’s (i.e., whose sample values are functions of the sample values of the
other rv’s) is taken up in more generality in Section 1.3.8, but the interarrival times and
aggregate arrivals for Bernoulli processes are so specialized and simple that it is better to
treat them from ﬁrst principles.
First, consider the ﬁrst interarrival time, X
1
, which is deﬁned as the time of the ﬁrst arrival.
If Z
1
= 1, then (and only then) X
1
= 1. Thus p
X
1
(1) = p. Next, X
1
= 2 if and only Z
1
= 0
and Z
2
= 1, so p
X
1
(2) = p(1p). Continuing, we see that X
1
has the geometric PMF,
p
X
1
(j) = p(1 p)
j1
where j 1.
r
r
r
X1
-
X2
-
X3
-
S
2
i
Zi
Si
0 1 2 3 4 5 6 7 8
0 1 1 0 0 1 0 0
0 1 2 2 2 3 3 3
Figure 1.2: Illustration of a sample path for a Bernoulli process: The sample values of
the binary rv’s Z
i
are shown below the time instants. The sample value of the aggregate
number of arrivals, S
n
=

n
i=1
Z
i
, is the illustrated step function, and the interarrival
intervals are the intervals between steps.
Each subsequent interarrival time X
k
can be found in this same way.
20
It has the same
geometric PMF and is statistically independent of X
1
, . . . , X
k1
. Thus the sequence of
interarrival times is an IID sequence of geometric rv’s.
It can be seen from Figure 1.2 that a sample path of interarrival times also determines a
sample path of the binary arrival rv’s, {Z
i
; i 1}. Thus the Bernoulli process can also be
characterized in terms of a sequence of IID geometric rv’s.
For our present purposes, the most important rv’s in a Bernoulli process are the partial
sums S
n
=

n
i=1
Z
i
. Each rv S
n
is the number of arrivals up to and including time n, i.e.,
S
n
is simply the sum of n binary IID rv’s and thus has the binomial distribution. That is,
p
Sn
(k) is the probability that k out of n of the Z
i
’s have the value 1. There are
_
n
k
_
=
n!
k!(nk)!
arrangements of a binary n-tuple with k 1’s, and each has probability p
k
q
nk
. Thus
p
Sn
(k) =
✓
n
k
◆
p
k
q
nk
. (1.23)
We will use the binomial PMF extensively as an example in explaining the laws of large
numbers later in this chapter, and will often use it in later chapters as an example of a sum
20
This is one of those maddening arguments that, while intuitively obvious, requires some careful reasoning
to be completely convincing. We go through several similar arguments with great care in Chapter 2, and
suggest that skeptical readers wait until then to prove this rigorously.
18 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
of IID rv’s. For these examples, we need to know how p
Sn
(k) behaves asymptotically as
n ! 1 and k ! 1. The relative frequency k/n will be denoted as ˜ p. We make a short
digression here to state and develop an approximation to the binomial PMF that makes
this asymptotic behavior clear.
Theorem 1.3.1. Let p
Sn
(˜ pn) be the PMF of the binomial distribution for an underlying
binary PMF p
Z
(1) = p > 0, p
Z
(0) = q > 0. Then for each integer ˜ pn, 1  ˜ pn  n 1,
p
Sn
(˜ pn) <
¸
1
2⇡n˜ p(1˜ p)
exp
⇥
nc(p, ˜ p)
⇤
where (1.24)
c(p, ˜ p) = ˜ p ln(
p
˜ p
) (1 ˜ p) ln(
1 p
1 ˜ p
)  0. (1.25)
Also, c(p, ˜ p) > 0 for all ˜ p 6= p . Finally, for each ˜ p in (0, 1), there is an n(˜ p) such that
p
Sn
(˜ pn) >
✓
1
1
p
n
◆
¸
1
2⇡n˜ p(1˜ p)
exp
⇥
nc(p, ˜ p)
⇤
forall n n(˜ p). (1.26)
Discussion: The parameter ˜ p = k/n is the relative frequency of 1’s in the n-tuple Z
1
, . . . , Z
n
.
For each n, ˜ p on the left of (1.24) is restricted so that ˜ pn is an integer. The theorem then
says that p
Sn
(˜ pn) is upper bounded by an exponentially decreasing function of n for each
˜ p 6= p. For 0 < ˜ p < 1 the ratio of the upper and lower bounds on p
Sn
(˜ pn) approaches 1 as
n !1. A bound that is asymptotically tight in this way is denoted as
p
Sn
(˜ pn)) ⇠
¸
1
2⇡n˜ p(1˜ p)
exp
⇥
nc(p, ˜ p)
⇤
for ✏ < ˜ p < 1 ✏ (1.27)
where the symbol ⇠ means that the ratio of the left to the right side approaches 1 as n !1.
The function c(p, ˜ p) is known as the Kullback-Liebler divergence between a binary rv of
parameter p and one of parameter ˜ p. It is a measure of how di↵erent p and ˜ p are and we
will see the same quantity for arbitrary discrete rv’s in Chapter 5.
Proof*:
21
The factorial of any positive integer n is bounded by the Stirling bounds,
22
p
2⇡n
⇣
n
e
⌘
n
< n! <
p
2⇡n
⇣
n
e
⌘
n
e
1/12n
. (1.28)
The ratio
p
2⇡n(n/e)
n
/n! is monotonically increasing with n toward the limit 1, and the
ratio
p
2⇡n(n/e)
n
exp(1/12n)/n! is monotonically decreasing toward 1. The upper bound
is more accurate, but the lower bound is simpler and known as the Stirling approximation.
Since
p
2⇡n(n/e)
n
/n! is increasing in n, we see that n!/k! <
_
n/k n
n
k
k
e
n+k
for k < n.
Combining this with (1.28) applied to n k,
✓
n
k
◆
<
_
n
2⇡k(n k)
n
n
k
k
(nk)
nk
. (1.29)
21
Proofs with an asterisk can be omitted without an essential loss of continuity.
22
See Feller [7] for a derivation of these results about the Stirling bounds. Feller also shows that an
improved lower bound to n! is given by
p
2⇡n(n/e)
n
exp[
1
12n

1
12(n k)
◆
>
_
n
2⇡k(n k)
n
n
k
k
(nk)
nk
_
1
1
12n˜ p(1 ˜ p)
_
. (1.30)
For ✏ < ˜ p < 1 ✏, the term in brackets in (1.30) is lower bounded by 1 1/(12n✏(1 ✏)),
which is further lower bounded by 1 1/
p
n for all suciently large n, establishing (1.26).
Finally, to show that c(p, ˜ p) 0, with strict inequality for ˜ p 6= p, we take the ﬁrst two
derivatives of c(p, ˜ p) with respect to ˜ p.
0c(p, ˜ p)
0˜ p
= ln
✓
p(1 ˜ p)
˜ p(1 p)
◆
0f
2
(p, ˜ p)
0˜ p
2
=
1
˜ p(1 ˜ p)
.
Since the second derivative is positive for 0 < ˜ p < 1, the minimum of c(p, ˜ p) with respect
to ˜ p is 0, achieved at ˜ p = p. Thus c(p, ˜ p) > 0 for ˜ p 6= p. Furthermore, c(p, ˜ p) increases as ˜ p
moves in either direction away from p.
Various aspects of this theorem will be discussed later with respect to each of the laws of
large numbers.
We saw earlier that the Bernoulli process can also be characterized as a sequence of IID
geometric interarrival intervals. An interesting generalization of this arises by allowing the
interarrival intervals to be arbitrary discrete or continuous nonnegative IID rv’s rather than
geometric rv’s. These processes are known as renewal processes and are the topic of Chapter
5. Poisson processes are special cases of renewal processes in which the interarrival intervals
have an exponential PDF. These are treated in Chapter 2 and have many connections to
Bernoulli processes.
Renewal processes are examples of discrete stochastic processes. The distinguishing charac-
teristic of such processes is that interesting things (arrivals, departures, changes of state)
occur at discrete instants of time separated by deterministic or random intervals. Discrete
stochastic processes are to be distinguished from noise-like stochastic processes in which
changes are continuously occurring and the sample paths are continuously varying func-
tions of time. The description of discrete stochastic processes above is not intended to be
precise, but Chapters 2, 4, and 5 are restricted to discrete stochastic processes in this sense,
whereas Chapter 3 is restricted to continuous processes.
1.3.7 Expectation
The expected value E[X] of a random variable X is also called the expectation or the mean
and is frequently denoted as X. Before giving a general deﬁnition, we discuss several special
20 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
cases. First consider nonnegative discrete rv’s. The expected value E[X] is then given by
E[X] =

x
xp
X
(x). (1.31)
If X has a ﬁnite number of possible sample values, the above sum must be ﬁnite since each
sample value must be ﬁnite. On the other hand, if X has a countable number of nonnegative
sample values, the sum in (1.31) might be either ﬁnite or inﬁnite. Example 1.3.1 illustrates
a case in which the sum is inﬁnite. The expectation is said to exist only if the sum is ﬁnite
(i.e., if the sum converges to a real number), and in this case E[X] is given by (1.31). If the
sum is inﬁnite, we say that E[X] does not exist, but also say
23
that E[X] = 1. In other
words, (1.31) can be used in both cases, but E[X] is said to exist only if the sum is ﬁnite.
Example 1.3.1. This example will be useful frequently in illustrating rv’s that have an
inﬁnite expectation. Let N be a positive integer-valued rv with the distribution function
F
N
(n) = n/(n + 1) for each integer n 1. Then N is clearly a positive rv since F
N
(0) = 0
and lim
n!1
F
N
(n) = 1. For each n 1, the PMF is given by
p
N
(n) = F
N
(n) F
N
(n 1) =
n
n + 1

n 1
n
=
1
n(n + 1)
. (1.32)
Since p
N
(n) is a PMF, we see that

1
n=1
1/[n(n+1)] = 1, which is a frequently useful fact.
The following equation, however, shows that E[N] does not exist and has inﬁnite value.
E[N] =
1

n=1
np
N
(n) =
1

n=1
n
n(n + 1)
=
1

n=1
1
n + 1
= 1,
where we have used the fact that the harmonic series diverges.
We next derive an alternative expression for the expected value of a nonnegative discrete rv.
This new expression is given directly in terms of the distribution function. We then use this
new expression as a general deﬁnition of expectation which applies to all nonnegative rv’s,
whether discrete, continuous, or arbitrary. It contains none of the convergence questions
that could cause confusion for arbitrary rv’s or for continuous rv’s with wild densities.
For a nonnegative discrete rv X, Figure 1.3 illustrates that (1.31) is simply the integral of
the complementary distribution function, where the complementary distribution function
F
c
of a rv is deﬁned as F
c
X
(x) = Pr{X > x} = 1 F
X
(x).
E[X] =
_
1
0
F
c
X
dx =
_
1
0
Pr{X > x} dx. (1.33)
Although Figure 1.3 only illustrates the equality of (1.31) and (1.33) for one special case, one
easily sees that the argument applies to any nonnegative discrete rv, including those with
countably many values, by equating the sum of the indicated rectangles with the integral.
23
It seems metaphysical to say that something doesn’t exist but has inﬁnite value. However, the word
exist here is shorthand for exist as a real number, which makes it quite reasonable to also consider the value
in the extended real number system, which includes ±1.
1.3. PROBABILITY REVIEW 21
a4pX(a4)
a3pX(a3)
a2pX(a2)
a1pX(a1)
pX(a4)
pX(a3)
pX(a2)
pX(a1)
pX(a0)
1
a1
a2
a3
a4
F
c
X
(x)
s
s
s
s
s
Figure 1.3: The ﬁgure shows the complementary distribution function F
c
X
of a nonneg-
ative discrete rv X. For this example, X takes on ﬁve possible values, 0 = a
0
< a
1
<
a
2
< a
3
< a
4
. Thus F
c
X
(x) = Pr{X > x} = 1p
X
(a
0
) for x < a
1
. For a
1
 x < a
2
,
Pr{X > x} = 1p
X
(a
0
)p
X
(a
1
), and Pr{X > x} has similar drops as x reaches a
2
,
a
3
, and a
4
. E[X], from (1.31), is

i
a
i
p
X
(a
i
), which is the sum of the rectangles in the
ﬁgure. This is also the area under the curve F
c
X
(x), i.e.,
_
1
0
F
c
X
(x) dx. It can be seen
that this argument applies to any nonnegative rv, thus verifying (1.33).
For a nonnegative integer valued rv X, (1.33) reduces to a simpler form that is often
convenient when X has a countable set of sample values.
E[X] =
1

n=0
Pr{X > n} =
1

n=1
Pr{X n} (1.34)
For an arbitrary nonnegative rv X, we can visualize quantizing the rv, ﬁnding the expected
value of the quantized rv, and then going to the limit of arbitrarily ﬁne quantizations. Each
quantized rv is discrete, so its expectation is given by (1.33) applied to the quantized rv.
Each such expectation can be viewed as a Riemann sum for the integral
_
1
0
F
c
X
(x) dx of
the original rv.
There are no mathematical subtleties in integrating an arbitrary nonnegative non-increasing
function, so
_
1
0
F
c
X
(x) dx must have either a ﬁnite or inﬁnite limit. This leads us to the
following fundamental deﬁnition of expectation for arbitrary nonnegative rv’s:
Deﬁnition 1.3.6. The expectation E[X] of a nonnegative rv X is deﬁned by (1.33). The
expectation is said to exist if and only if the integral is ﬁnite. Otherwise the expectation is
said to not exist and is also said to be inﬁnite.
Exercise 1.7 shows that this deﬁnition is consistent with the conventional deﬁnition of
expectation for the case of continuous rv’s, i.e.,
E[X] = lim
b!1
_
b
0
xf
X
(x) dx. (1.35)
This can also be seen using integration by parts.
22 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
Next consider rv’s with both positive and negative sample values. If X has a ﬁnite number
of positive and negative sample values, say a
1
, a
2
, . . . , a
n
the expectation E[X] is given by
E[X] =

i
a
i
p
X
(a
i
)
=

a
i
0
a
i
p
X
(a
i
) +

a
i
>0
a
i
p
X
(a
i
). (1.36)
If X has a countably inﬁnite set of sample values, then (1.36) can still be used if each of
the sums in (1.36) converges to a ﬁnite value, and otherwise the expectation does not exist
(as a real number). It can be seen that each sum in (1.36) converges to a ﬁnite value if and
only if E[|X|] exists (i.e., converges to a ﬁnite value) for the nonnegative rv |X|.
If E[X] does not exist (as a real number), it still might have the value 1 if the ﬁrst sum
converges and the second does not, or the value 1 if the second sum converges and the
ﬁrst does not. If both sums diverge, then E[X] is undeﬁned, even as ±1. In this latter
case, the partial sums can be arbitrarily small or large depending on the order in which the
terms of (1.36) are summed (see Exercise 1.9).
As illustrated for a ﬁnite number of sample values in Figure 1.4, the expression in (1.36)
can also be expressed directly in terms of the distribution function and complementary
distribution function as
E[X] =
_
0
1
F
X
(x) dx +
_
1
0
F
c
X
(x) dx. (1.37)
Since F
c
X
(x) = 1 F
X
(x), this can also be expressed as
E[X] =
_
1
1
⇥
u(x) F
X
(x)
⇤
dx,
where u(x) is the unit step, u(x) = 1 for x 0 and u(x) = 0 otherwise.
a4pX(a4)
a3pX(a3) a2pX(a2)
a1pX(a1)
a1
a2
a3
a4
F
c
X
(x)
FX(x)
0
s
s
s
s
s
Figure 1.4: For this example, X takes on four possible sample values, a
1
< a
2
< 0 <
a
3
< a
4
. The ﬁgure plots F
X
(x) for x  0 and F
c
X
(x) for x > 0. As in Figure 1.3,
_
x0
F
c
X
(x) dx = a
3
f
X
(a
3
)+a
4
f
X
(a
4
). Similarly,
_
x<0
F
X
(x) dx = a
1
f
X
(a
1
)a
2
f
X
(a
2
).
The ﬁrst integral in (1.37) corresponds to the negative sample values and the second to the
positive sample values, and E[X] exists if and only if both integrals are ﬁnite (i.e., if E[|X|]
is ﬁnite).
For continuous-valued rv’s with positive and negative sample values, the conventional deﬁ-
nition of expectation (assuming that E[|X|] exists) is given by
E[X] =
_
1
1
xf
X
(x) dx. (1.38)
1.3. PROBABILITY REVIEW 23
This is equal to (1.37) by the same argument as with nonnegative rv’s. Also, as with non-
negative rv’s, (1.37) also applies to arbitrary rv’s. We thus have the following fundamental
deﬁnition of expectation:
Deﬁnition 1.3.7. The expectation E[X] of a rv X exists, with the value given in (1.37),
if each of the two terms in (1.37) is ﬁnite. The expectation does not exist, but has value 1
(1), if the ﬁrst term is ﬁnite (inﬁnite) and the second inﬁnite (ﬁnite). The expectation
does not exist and is undeﬁned if both terms are inﬁnite.
We should not view the general expression in (1.37) for expectation as replacing the need
for the conventional expressions in (1.38) and (1.36). We will use all of these expressions
frequently, using whichever is most convenient. The main advantages of (1.37) are that
it applies equally to all rv’s, it poses no questions about convergence, and it is frequently
useful, especially in limiting arguments.
Example 1.3.2. The Cauchy rv X is the classic example of a rv whose expectation does
not exist and is undeﬁned. The probability density is f
X
(x) =
1
⇡(1+x
2
)
. Thus xf
X
(x) is
proportional to 1/x both as x ! 1 and as x ! 1. It follows that
_
1
0
xf
X
(x) dx and
_
0
1
xf
X
(x) dx are both inﬁnite. On the other hand, we see from symmetry that the
Cauchy principal value of the integral in (1.38) is given by
lim
A!1
_
A
A
x
⇡(1 +x
2
)
dx = 0.
There is usually little motivation for considering the upper and lower limits of the integration
to have the same magnitude, and the Cauchy principal value usually has little signiﬁcance
for expectations.
1.3.8 Random variables as functions of other random variables
Random variables (rv’s) are often deﬁned in terms of each other. For example, if h is a
function from R to R and X is a rv, then Y = h(X) is the random variable that maps each
sample point . to the composite function h(X(.)). The distribution function of Y can be
found from this, and the expected value of Y can then be evaluated by (1.37).
It is often more convenient to ﬁnd E[Y ] directly using the distribution function of X.
Exercise 1.19 indicates that E[Y ] is given by
_
h(x)f
X
(x) dx for continuous rv’s and by

x
h(x)p
X
(x) for discrete rv’s. In order to avoid continuing to use separate expressions for
continuous and discrete rv’s, we express both of these relations by
E[Y ] =
_
1
1
h(x) dF
X
(x), (1.39)
This is known as a Stieltjes integral, which can be used as a generalization of both the
continuous and discrete cases. For most purposes, we use Stieltjes integrals
24
as a notational
shorthand for either
_
h(x)f
X
(x) dx or

x
h(x)p
X
(x).
24
More speciﬁcally, the Riemann-Stieltjes integral, abbreviated here as the Stieltjes integral, is denoted as
24 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
The existence of E[X] does not guarantee the existence of E[Y ], but we will treat the
question of existence as it arises rather than attempting to establish any general rules.
Particularly important examples of such expected values are the moments E[X
n
] of a rv
X and the central moments E
⇥
(X X)
n
⇤
of X, where X is the mean E[X]. The second
central moment is called the variance, denoted by o
2
X
or VAR[X]. It is given by
o
2
X
= E
⇥
(X X)
2
⇤
= E
⇥
X
2
⇤
X
2
. (1.40)
The standard deviation o
X
of X is the square root of the variance and provides a measure of
dispersion of the rv around the mean. Thus the mean is often viewed as a ‘typical value’ for
the outcome of the rv (see Section 1.3.10) and o
X
is similarly viewed as a typical di↵erence
between X and X. An important connection between the mean and standard deviation is
that E
⇥
(X x)
2
⇤
is minimized over x by choosing x to be E[X] (see Exercise 1.24).
Next suppose X and Y are rv’s and consider the rv
25
Z = X + Y . If we assume that X
and Y are independent, then the distribution function of Z = X +Y is given by
26
F
Z
(z) =
_
1
1
F
X
(z y) dF
Y
(y) =
_
1
1
F
Y
(z x) dF
X
(x). (1.41)
If X and Y both have densities, this can be rewritten as
f
Z
(z) =
_
1
1
f
X
(z y)f
Y
(y) dy =
_
1
1
f
Y
(z x)f
X
(x) dx. (1.42)
Eq. (1.42) is the familiar convolution equation from linear systems, and we similarly refer
to (1.41) as the convolution of distribution functions (although it has a di↵erent functional
form from (1.42)). If X and Y are nonnegative random variables, then the integrands in
(1.41) and (1.42) are non-zero only between 0 and z, so we often use 0 and z as the limits
in (1.41) and (1.42).
If X
1
, X
2
, . . . , X
n
are independent rv’s, then the distribution of the rv S
n
= X
1
+X
2
+· · · +
X
n
can be found by ﬁrst convolving the distributions of X
1
and X
2
to get the distribution
of S
2
and then, for each n 2, convolving the distribution of S
n
and X
n+1
to get the
distribution of S
n+1
. The distributions can be convolved in any order to get the same
resulting distribution.
R
b
a
h(x)dFX(x). This integral is deﬁned as the limit of a generalized Riemann sum, lim
!0
P
n
h(xn)[F(yn)
F(yn1)] where {yn; n 1} is a sequence of increasing numbers from a to b satisfying yn yn1  and
yn1 < xn  yn for all n. The Stieltjes integral is deﬁned to exist over ﬁnite limits if the limit exists and
is independent of the choices of {yn} and {xn} as !0. It exists over inﬁnite limits if it exists over ﬁnite
lengths and a limit over the integration limits can be taken. See Rudin [20] for an excellent elementary
treatment of Stieltjes integration, and see Exercise 1.15 for some examples.
25
The question whether a real-valued function of a rv is itself a rv is usually addressed by the use of measure
theory, and since we neither use nor develop measure theoary in this text, we usually simply assume (within
the limits of common sense) that any such function is itself a rv. However, the sum X + Y of rv’s is so
important that Exercise 1.13 provides a guided derivation of this result for X + Y . In the same way, the
sum Sn = X1 + · · · + Xn of any ﬁnite collection of rv’s is also a rv.
26
See Exercise 1.15 for some peculiarities about this deﬁnition.
1.3. PROBABILITY REVIEW 25
Whether or not X
1
, X
2
, . . . , X
n
are independent, the expected value of S
n
= X
1
+ X
2
+
· · · +X
n
satisﬁes
E[S
n
] = E[X
1
+X
2
+ · · · +X
n
] = E[X
1
] +E[X
2
] + · · · +E[X
n
] . (1.43)
This says that the expected value of a sum is equal to the sum of the expected values,
whether or not the rv’s are independent (see exercise 1.14). The following example shows
how this can be a valuable problem solving aid with an appropriate choice of rv’s.
Example 1.3.3. Consider a switch with n input nodes and n output nodes. Suppose each
input is randomly connected to a single output in such a way that each output is also
connected to a single input. That is, each output is connected to input 1 with probability
1/n. Given this connection, each of the remaining outputs are connected to input 2 with
probability 1/(n 1), and so forth.
An input node is said to be matched if it is connected to the output of the same number.
We want to show that the expected number of matches (for any given n) is 1. Note that
the ﬁrst node is matched with probability 1/n, and therefore the expectation of a match
for node 1 is 1/n. Whether or not the second input node is matched depends on the choice
of output for the ﬁrst input node, but it can be seen from symmetry that the marginal
distribution for the output node connected to input 2 is 1/n for each output. Thus the
expectation of a match for node 2 is also 1/n. In the same way, the expectation of a match
for each input node is 1/n. From (1.43), the expected total number of matches is the sum
over the expected number for each input, and is thus equal to 1. This exercise would be
much more dicult without the use of (1.43).
If the rv’s X
1
, . . . , X
n
are independent, then, as shown in exercises 1.14 and 1.21, the
variance of S
n
= X
1
+ · · · +X
n
is given by
o
2
Sn
=

n
i=1
o
2
X
i
. (1.44)
If X
1
, . . . , X
n
are also identically distributed (i.e., X
1
, . . . , X
n
are IID) with variance o
2
X
,
then o
2
Sn
= no
2
X
. Thus the standard deviation of S
n
is o
Sn
=
p
no
X
. Sums of IID rv’s
appear everywhere in probability theory and play an especially central role in the laws of
large numbers. It is important to remember that the mean of S
n
is linear in n but the
standard deviation increases only with the square root of n. Figure 1.5 illustrates this
behavior.
1.3.9 Conditional expectations
Just as the conditional distribution of one rv conditioned on a sample value of another rv
is important, the conditional expectation of one rv based on the sample value of another is
equally important. Initially let X be a positive discrete rv and let y be a sample value of
another discrete rv Y such that p
Y
(y) > 0. Then the conditional expectation of X given
Y = y is deﬁned to be
E[X | Y =y] =

j=2
j
2
p
S
(j) =
E[S]
2
=
7
2
.
This example is not intended to show the value of (1.47) in calculating expectation, since
E[X
1
] = 7/2 is initially obvious from the uniform integer distribution of X
1
. The purpose
is simply to illustrate what the rv E[X
1
| S] means.
To illustrate (1.47) in a more general way, while still assuming X to be discrete, we can
write out this expectation by using (1.45) for E[X | Y = y].
E[X] = E
⇥
E[X | Y ]
⇤
=

y
p
Y
(y)E[X | Y = y]
=

y
p
Y
(y)

x
xp
X|Y
(x|y). (1.48)
Operationally, there is nothing very fancy in the example or in (1.47). Combining the
sums, (1.48) simply says that E[X] =

y,x
xp
Y X
(y, x). As a concept, however, viewing the
conditional expectation E[X | Y ] as a rv based on the conditioning rv Y is often a useful
theoretical tool. This approach is equally useful as a tool in problem solving, since there
are many problems where it is easy to ﬁnd conditional expectations, and then to ﬁnd the
total expectation by averaging over the conditioning variable. For this reason, this result is
sometimes called either the total expectation theorem or the iterated expectation theorem.
Exercise 1.20 illustrates the advantages of this approach, particularly where it is initially
clear that the expectation is ﬁnite. The following cautionary example, however, shows that
this approach can sometimes hide convergence questions and give the wrong answer.
27
This assumes that E[X | Y = y] is ﬁnite for each y, which is one of the reasons that expectations are
said to exist only if they are ﬁnite.
28 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
Example 1.3.5. Let Y be a geometric rv with the PMF p
Y
(y) = 2
y
for integer y 1.
Let X be an integer rv that, conditional on Y , is binary with equiprobable values ±2
y
given Y = y. We then see that E[X | Y = y] = 0 for all y, and thus, (1.48) indicates that
E[X] = 0. On the other hand, it is easy to see that p
X
(2
k
) = p
X
(2
k
) = 2
k1
for each
integer k 1. Thus the expectation over positive values of X is 1 and that over negative
values is 1. In other words, the expected value of X is undeﬁned and (1.48) is incorrect.
The diculty in the above example cannot occur if X is a nonnegative rv. Then (1.48) is
simply a sum of a countable number of nonnegative terms, and thus it either converges to
a ﬁnite sum independent of the order of summation, or it diverges to 1, again independent
of the order of summation.
If X has both positive and negative components, we can separate it into X = X
+
+X

where
X
+
= max(0, X) and X

= min(X, 0). Then (1.48) applies to X
+
and X

separately.
If at most one is inﬁnite, then (1.48) applies to X, and otherwise X is undeﬁned. This is
summarized in the following theorem:
Theorem 1.3.2 (Total expectation). Let X and Y be discrete rv’s. If X is nonnegative,
then E[X] = E
⇥
E[X | Y ]
⇤
=

y
p
Y
(y)E[X | Y = y]. If X has both positive and negative
values, and if at most one of E[X
+
] and E[X

] is inﬁnite, then E[X] = E
⇥
E[X | Y ]
⇤
=

y
p
Y
(y)E[X | Y = y].
We have seen above that if Y is a discrete rv, then the conditional expectation E[X|Y = y] is
only a little more complicated than the unconditional expectation, and this is true whether
X is discrete, continuous, or arbitrary. If X and Y are continuous, we can essentially extend
these results to probability densities. In particular, deﬁning E[X | Y = y] as
E[X | Y = y] =
_
1
1
xf
X|Y
(x | y) dx, (1.49)
we have
E[X] =
_
1
1
f
Y
(y)E[X | Y =y] dy =
_
1
1
f
Y
(y)
_
1
1
xf
X|Y
(x | y) dxdy. (1.50)
We do not state this as a theorem because the details about the integration do not seem
necessary for the places where it is useful.
1.3.10 Typical values of rv’s; mean and median
The distribution function of a rv often has more detail than we are interested in, and the
mean is often taken as a ‘typical value.’ Similarly, in statistics, the average of a set of
numerical data values is often taken to be representative of the entire set. For example,
students always want to know the average of the scores on an exam, and investors always
want to know the Dow-Jones average. Economists are also interested, for example, in such
averages as the average annual household income over various geographical regions. These
averages often take on an importance and a life of their own, particlarly in terms of how
they vary in time.
1.3. PROBABILITY REVIEW 29
The median of a rv (or set of data values) is often an alternate choice of a single number
to serve as a typical value. We say that ↵ is a median of X if Pr{X  ↵} 1/2 and
Pr{X ↵} 1/2. It is possible for the median to be non-unique, with all values in an
interval satisfying the deﬁnition. Exercise 1.10 illustrates what this deﬁnition means. In
addition, Exercise 1.11 shows that if the mean exists, then median is an x that minimizes
E[|X x|].
Another interesting property of the median, suggested in Exercise 1.36 is that in essence a
median of a large number of IID sample values of a rv is close to a median of the distribution
with high probability. Another property, relating the median ↵ to the mean X of a rv with
standard deviation o, is (see Exercise 1.35)
| X ↵ |  o (1.51)
The question now arises whether the mean or the median is preferable as a single number
describing a rv. The question is too vague to be answered in any generality, but the answer
depends heavily on what the single number is to be used for. To illustrate this, consider
a rv whose sample values are the yearly household incomes of a large society (or, almost
equivalently, consider a large data set consisting of these yearly household incomes).
For the mean, the probability of each sample value is weighted by the household income,
so that a household income of $10
9
is weighted the same as 100,000 household incomes of
$10
4
each. For the median, this weighting disappears, and if our billionaire has a truly
awful year with only $10
6
income, the median is unchanged. If one is interested in the total
purchasing power of the society, then the mean might be the more appropriate value. On
the other hand, if one is interested in the well-being of the society, the median is the more
appropriate value.
28
1.3.11 Indicator random variables
For any event A, the indicator random variable of A, denoted I
A
, is a binary rv that has
the value 1 for all . 2 A and the value 0 otherwise. It then has the PMF p
I
A
(1) = Pr{A}
and p
I
A
(0) = 1 Pr{A}. The corresponding distribution function F
I
A
is then illustrated in
Figure 1.6. It is easily seen that E[I
A
] = Pr{A}.
0
1 0
F
I
A
1 Pr{A}
1
Figure 1.6: The distribution function F
I
A
of an indicator random variable I
A
.
Indicator rv’s are useful because they allow us to apply the many known results about
rv’s and particularly binary rv’s to events. For example, the laws of large numbers are
28
Unfortunately, the choice between median and mean (and many similar choices) is often made for
commericial or political expediency rather than scientiﬁc or common-sense appropriateness.
30 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
expressed in terms of sums of rv’s. If these rv’s are taken to be the indicaor functions for
the occurencees of an event over successive trials, then the law of large numbers applies to
the relative frequency of that event.
1.3.12 Moment generating functions and other transforms
The moment generating function (MGF) for a rv X is given by
g
X
(r) = E
⇥
e
rX
⇤
=
_
1
1
e
rx
dF
X
(x). (1.52)
where r is a real variable. The integrand is nonnegative, and we can study where the integral
exists (i.e., where it is ﬁnite) by separating it as follows:
g
X
(r) =
_
1
0
e
rx
dF
X
(x) +
_
0
1
e
rx
dF
X
(x). (1.53)
Both integrals exist for r = 0, since the ﬁrst is Pr{X > 0} and the second is Pr{X  0}.
The ﬁrst integral is increasing in r, and thus if it exists for one value of r, it also exists
for all smaller values. For example, if X is a nonnegative exponential rv with the density
f
X
(x) = e
x
, then the ﬁrst integral exists if and only if r < 1, and it then has the value
1
1r
. As another example, if X satisﬁes Pr{X > A} = 0 for some ﬁnite A, then the ﬁrst
integral is at most e
rA
, which is ﬁnite for all real r.
Let r
+
(X) be the supremum of values of r for which the ﬁrst integral exists. Then 0 
r
+
(X)  1 and the ﬁrst integral exists for all r < r
+
(X). In the same way, let r

(X) be
the inﬁmum of values of r for which the the second integral exists. Then 0 r

(X) 1
and the second integral exists for all r > r

(X).
Combining the two integrals, the region of r over which g
X
(r) exists is an interval I(X)
from r

(X)  0 to r
+
(X) 0. Either or both of the end points, r

(X) and r
+
(X), might
be included in I(X), and either or both might be either 0 or inﬁnite. We denote these
quantities as I, r

, and r
+
when the rv X is clear from the context. Tables 1.3.3 and 1.3.3
give the interval I for a number of standard rv’s and Exercise 1.25 illustrates I(X) further.
If g
X
(r) exists in an open region of r around 0 (i.e., if r

< 0 < r
+
), then derivatives
29
of
all orders exist in that region. They are given by
d
k
g
X
(r)
dr
k
=
_
1
1
x
k
e
rx
dF
X
(x) ;
d
k
g
X
(r)
dr
k
¸
¸
¸
¸
¸
r=0
= E
_
X
k
_
. (1.54)
This shows that ﬁnding the moment generating function often provides a convenient way
to calculate the moments of a random variable (see Exercise 3.2 for an example). If any
29
This result depends on interchanging the order of di↵erentiation (with respect to r) and integration
(with respect to x). This can be shown to be permissible because gX(r) exists for r both greater and smaller
than 0, which in turn implies, ﬁrst, that 1 FX(x) must approach 0 at least exponentially as x !1 and,
second, that FX(x) must approach 0 at least exponentially as x !1.
1.3. PROBABILITY REVIEW 31
moment of a rv fails to exist, however, then the MGF must also fail to exist over any open
interval containing 0 (see Exercise 1.39).
Another important feature of moment generating functions is their usefulness in treating
sums of independent rv’s. For example, let S
n
= X
1
+X
2
+ · · · +X
n
. Then
g
Sn
(r) = E
⇥
e
rSn
⇤
= E
_
exp
⇣

n
i=1
rX
i
⌘_
= E
_

n
i=1
exp(rX
i
)
_
=

n
i=1
g
X
i
(r). (1.55)
In the last step, we have used a result of Exercise 1.14, which shows that for independent
rv’s, the mean of the product is equal to the product of the means. If X
1
, . . . , X
n
are also
IID, then
g
Sn
(r) = [g
X
(r)]
n
. (1.56)
We will use this property frequently in treating sums of IID rv’s. Note that this also implies
that the region over which the MGF’s of S
n
and X exist are the same, i.e., I(S
n
) = I(X).
The real variable r in the MGF can be replaced by a complex variable, giving rise to a
number of other transforms. A particularly important case is to view r as a pure imaginary
variable, say i✓ where i =
p
1 and ✓ is real. Then
30
g
X
(i✓) = E
⇥
e
i✓x
⇤
is called the
characteristic function of X. Since |e
i✓x
| is 1 for all x, g
X
(i✓) exists for all rv’s X and all
real ✓, and its magnitude is at most one.
A minor but important variation on the characteristic function of X is the Fourier transform
of the probability density of X. If X has a density f
X
(x), then the Fourier transform of
f
X
(x) is given by
g
X
(i2⇡✓) =
_
1
1
f
X
(x) exp(i2⇡✓) dx (1.57)
The major advantage of the Fourier transform (aside from its familiarity) is that f
X
(x) can
usually be found from g
X
(i2⇡✓) as the inverse Fourier transform,
31
f
X
(x) =
_
1
1
g
X
(i2⇡✓) exp(i2⇡✓x) dx, (1.58)
The Z-transform is the result of replacing e
r
with z in g
X
(r). This is useful primarily
for integer valued rv’s, but if one transform can be evaluated, the other can be found
immediately. Finally, if we use s, viewed as a complex variable, in place of r, we get the
two sided Laplace transform of the density of the random variable. Note that for all of
these transforms, multiplication in the transform domain corresponds to convolution of the
distribution functions or densities, and summation of independent rv’s. The simplicity of
taking products of transforms is a major reason that transforms are so useful in probability
theory.
30
The notation here can be slightly dangerous, since one cannot necessarily take an expression for gX(r),
valid for real r, and replace r by i✓ with real ✓ to get the characteristic function.
31
This integral does not necessarily converge, particularly if X does not have a PDF. However, it can
be shown (see [22] Chap. 2.12, or [8], Chap. 15) that the characteristic function/ Fourier transform of an
arbitrary rv does uniquely specify the distribution function.)
32 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
1.4 Basic inequalities
Inequalities play a particularly fundamental role in probability, partly because many im-
portant models are too complex to ﬁnd exact answers, and partly because many of the
most useful theorems establish limiting rather than exact results. In this section, we study
three related inequalities, the Markov, Chebyshev, and Cherno↵ bounds. These are used
repeatedly both in the next section and in the remainder of the text.
1.4.1 The Markov inequality
This is the simplest and most basic of these inequalities. It states that if a nonnegative
random variable Y has a mean E[Y ], then, for every y > 0, Pr{Y y} satisﬁes
32
Pr{Y y} 
E[Y ]
y
for every y > 0 (Markov Inequality). (1.59)
Figure 1.7 derives this result using the fact (see Figure 1.3) that the mean of a nonnegative
rv is the integral of its complementary distribution function, i.e., of the area under the curve
Pr{Y > z}. Exercise 1.31 gives another simple proof using an indicator random variable.
Area = yPr{Y y}
y
Pr{Y y}
Area under curve = E[Y ]

⇡
@
@
@
@R
?
Figure 1.7: Demonstration that yPr{Y y}  E[Y ]. By letting y ! 1, it can also
be seen that the shaded area becomes a negligible portion of the area E[Y ], so that
lim
y!1
yPr{Y > y} = 0 if E[Y ]  1.
As an example of this inequality, assume that the average height of a population of people
is 1.6 meters. Then the Markov inequality states that at most half of the population have a
height exceeding 3.2 meters. We see from this example that the Markov inequality is often
very weak. However, for any y > 0, we can consider a rv that takes on the value y with
probability ✏ and the value 0 with probability 1 ✏; this rv satisﬁes the Markov inequality
at the point y with equality. Figure 1.7 (as elaborated in Exercise 1.47) also shows that,
for any nonnegative rv Y with a ﬁnite mean,
lim
y!1
y Pr{Y y} = 0. (1.60)
32
The distribution function of any given rv Y is known (at least in principle), and thus one might question
why an upper bound is ever preferable to the exact value. One answer is that Y might be given as a function
of many other rv’s and that the parameters (such as the mean) used in a bound are often much easier to ﬁnd
than the distribution function. Another answer is that such inequalities are often used in theorems which
state results in terms of simple statistics such as the mean rather than the entire distribution function. This
will be evident as we use these bounds.
1.4. BASIC INEQUALITIES 33
This will be useful shortly in the proof of Theorem 1.5.4.
1.4.2 The Chebyshev inequality
We now use the Markov inequality to establish the well-known Chebyshev inequality. Let
Z be an arbitrary rv with ﬁnite mean E[Z] and ﬁnite variance o
2
Z
, and deﬁne Y as the
nonnegative rv Y = (Z E[Z])
2
. Thus E[Y ] = o
2
Z
. Applying (1.59),
Pr
_
(Z E[Z])
2
y
_

o
2
Z
y
for every y > 0.
Replacing y with ✏
2
and noting that the event {(ZE[Z])
2
✏
2
} is the same as |ZE[Z] |
✏, this becomes
Pr{|Z E[Z] | ✏} 
o
2
Z
✏
2
for every ✏ > 0 (Chebyshev inequality). (1.61)
Note that the Markov inequality bounds just the upper tail of the distribution function and
applies only to nonnegative rv’s, whereas the Chebyshev inequality bounds both tails of the
distribution function. The more important di↵erences, however, are that the Chebyshev
bound requires a ﬁnite variance and goes to zero inversely with the square of the distance
from the mean, whereas the Markov bound does not require a ﬁnite variance and goes to
zero inversely with the distance from 0 (and thus asymptotically with distance from the
mean).
The Chebyshev inequality is particularly useful when Z is the sample average, (X
1
+X
2
+
· · · +X
n
)/n, of a set of IID rv’s. This will be used shortly in proving the weak law of large
numbers.
1.4.3 Cherno↵ bounds
Cherno↵ (or exponential) bounds are another variation of the Markov inequality in which
the bound on each tail of the distribution function goes to 0 exponentially with distance from
the mean. For any given rv Z, let I(Z) be the interval over which the MGF g
Z
(r) = E
⇥
e
Zr
⇤
exists. Letting Y = e
Zr
for any r 2 I(Z), the Markov inequality (1.59) applied to Y is
Pr{exp(rZ) y} 
g
Z
(r)
y
for every y > 0 (Cherno↵ bound).
This takes on a more meaningful form if y is replaced by e
rb
. Note that exp(rZ) exp(rb)
is equivalent to Z b for r > 0 and to Z  b for r < 0. Thus, for any real b, we get the
following two bounds, one for r > 0 and the other for r < 0:
Pr{Zb}  g
Z
(r) exp(rb) ; (Cherno↵ bound for r > 0, r 2 I(Z)) (1.62)
Pr{Zb}  g
Z
(r) exp(rb) ; (Cherno↵ bound for r < 0, r 2 I(Z)). (1.63)
This provides us with a family of upper bounds, one for each r 2 I(Z), on the tails of the
distribution function. The upper tail is bounded by values of r > 0 and the lower tail by
34 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
values of r < 0. For given r > 0, this bound on Pr{Z b} decreases exponentially
33
in b at
rate r. Similarly, for given r < 0, the bound on Pr{Z  b} decreases exponentially at rate
|r| as b !1. We will see shortly that (1.62) is useful only when b > E[X] and (1.63) is
useful only when b < E[X].
The most important application of Cherno↵ bounds is to sums of IID rv’s. Let S
n
=
X
1
+· · · +X
n
where X
1
, . . . , X
n
are IID with the MGF g
X
(r). Then g
Sn
(r) = [g
X
(r)]
n
, so
(1.62) and (1.63) (with b replaced by na) become
Pr{S
n
na}  [g
X
(r)]
n
exp(rna) ; ( for 0 < r 2 I(Z)) (1.64)
Pr{S
n
na}  [g
X
(r)]
n
exp(rna) ; ( for 0 > r 2 I(Z)). (1.65)
These equations are easier to understand if we deﬁne the semi-invariant MGF, ¸
X
(r), as
¸
X
(r) = lng
X
(r). (1.66)
The semi-invariant MGF for a typical rv X is sketched in Figure 1.8. The major features
to observe are, ﬁrst, that ¸
0
X
(0) = E[X] and, second, that ¸
00
X
(r) 0 for r in the interior of
I(X).
@
@
@
@
@
@
slope = E[X]
r
¸(r)
¸(0) = 0
¸
0
(0) = E[X]
¸
00
(0) = o
2
X
¸
00
(r) > 0
0
Figure 1.8: Semi-invariant moment-generating function ¸(r) for a typical rv X assum-
ing r

< 0 < r
+
. Since ¸(r) = lng(r), we see that that
d
dr
¸(r) =
1
g(r)
d
dr
g(r). Thus
¸
0
(0) = E[X]. Also, for r in the interior of I(X), Exercise 1.27 shows that ¸
00
(r) 0,
and in fact, ¸
00
(r) is strictly positive except in the uninteresting case where X is de-
terministic (takes on a single value with probability 1). As indicated in the ﬁgure, the
straight line of slope E[X] through the origin is tangent to ¸(r).
In terms of ¸
X
(r), (1.64) and (1.65) become
Pr{S
n
na}  exp(n[¸
X
(r) ra]) ; ( for 0 < r 2 I(X)) (1.67)
Pr{S
n
na}  exp(n[¸
X
(r) ra]) ; ( for 0 > r 2 I(X)). (1.68)
These bounds are geometric in n for ﬁxed a and r, so for any given a, we can optimize the
bound simultaneously for all n by minimizing ¸
X
(r) ra. Since ¸
00
X
(r) > 0, the tightest
bound arises either at that r for which ¸
0
(r) = a or at one of the end points, r

, r
+
)
[¸
X
(r) ra].
33
This seems paradoxical, since Z seems to be almost arbitrary. However, since r 2 I(Z), we have
R
e
rb
dFZ(b) < 1.
34
The inﬁmum, denoted inf, of a set of numbers is the largest number less than or equal to all numbers in
the set. For example, inf{(0, 1)} = 0, whereas min{(0, 1)} does not exist.
1.4. BASIC INEQUALITIES 35
Note that (¸
X
(r) ra)|
r=0
= 0 and
d
dr
(¸
X
(r) ra)|
r=0
= E[X] a. Thus if a > E[X],
then ¸
X
(r) ra must be negative for suciently small positive r. Similarly, if a < E[X],
then ¸
X
(r) ra is negative for negative r suciently close
35
to 0. In other words,
Pr{S
n
na}  exp(nµ
X
(a)) ; where µ
X
(a) < 0 for a > E[X] (1.69)
Pr{S
n
na}  exp(nµ
X
(a)) ; where µ
X
(a) < 0 for a < E[X] . (1.70)
This is summarized in the following lemma:
Lemma 1.4.1. Assume that r

< 0 < r
+
) and let S
n
be the sum of n IID rv’s each with
the distribution of X. Then µ
X
(a) < 0 for all a 6= E[X]. Also, Pr{S
n
na}  e
nµ
X
(a)
for
a > E[X] and Pr{S
n
 na}  e
nµ
X
(a)
for a < E[X].
Figure 1.9 illustrates the lemma and gives a graphical construction to ﬁnd
36
µ
X
(a).
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
X
r
¸(r)
slope = ¸
0
(r
o
) = a
(ro) roa
(r) ra
0 r
0
¸(r
0
)
⌘
⌘
⌘
⌘
⌘
⌘
⌘
⌘
⌘
¸(r)
(r
0
) r
0
a
0
r
0
0
slope = a
Figure 1.9: Graphical minimization of ¸(r) ar: For any r 2 I(X), ¸(r) ar is the
vertical axis intercept of a line of slope a through the point (r, ¸(r)). The minimum
occurs when the line of slope a is tangent to the curve. The two examples show one
case where E[X] < 0 and another where E[X] > 0.
These Cherno↵ bounds will be used in the next section to help understand several laws of
large numbers. They will also be used extensively in Chapter 9 and are useful for detection,
random walks, and information theory.
The following example evaluates these bounds for the case where the IID rv’s are binary.
We will see that in this case the bounds are exponentially tight in a sense to be described.
Example 1.4.1. Let X be binary with p
X
(1) = p and p
X
(0) = q = 1 p. Then g
X
(r) =
q +pe
r
for 1< r < 1. Also, ¸
X
(r) = ln(q +pe
r
). To be consistent with the expression
for the binomial PMF in (1.24), we will ﬁnd bounds to Pr{S
n
˜ pn} and Pr{S
n
 ˜ pn} for
˜ p > p and ˜ p < p respectively. Thus, according to Lemma 1.4.1, we ﬁrst evaluate
µ
X
(˜ p) = inf
r
[¸
X
(r) ˜ pr].
35
In fact, for r suciently small, (r) can be approximated by a second order power series, (r) ⇡
(0) + r
0
(0) + (r
2
/2)
00
(0) = rX + (r
2
/2)
2
X
. It follows that µX(a) ⇡ (a X)
2
/2
2
X
for very small r.
36
As a special case, the inﬁmum might occur at the edge of the interval of convergence, i.e., at r or r+.
As shown in Exercise 1.26, the inﬁmum can be at r+ (r) only if gX(r+) (gX(r)) exists, and in this case,
the graphical technique in Figure 1.9 still works.
36 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
The minimum occurs at that r for which ¸
0
X
(r) = ˜ p, i.e., at
pe
r
q +pe
r
= ˜ p.
Rearranging terms,
e
r
=
˜ pq
p˜ q
where ˜ q = 1 ˜ p. (1.71)
Substituting this minimizing value of r into ln(q +pe
r
) r˜ p and rearranging terms,
µ
X
(˜ p) = ˜ p ln
p
˜ p
+ ˜ q ln
˜ q
q
. (1.72)
Substituting this into (1.69), and (1.70), we get the following Cherno↵ bounds for binary
IID rv’s. As shown above, they are exponentially decreasing in n.
Pr{S
n
n˜ p}  exp
⇢
n
_
˜ p ln
p
˜ p
+ ˜ q ln
q
˜ q
__
; for ˜ p > p (1.73)
Pr{S
n
n˜ p}  exp
⇢
n
_
˜ p ln
p
˜ p
+ ˜ q ln
q
˜ q
__
; for ˜ p < p. (1.74)
So far, it seems that we have simply developed another upper bound on the tails of the
distribution function for the binomial. It will then perhaps be surprising to compare this
bound with the asymptotically correct value (repeated below) for the binomial PMF in
(1.27) with ˜ p = k/n.
p
Sn
(k) ⇠
_
1
2⇡n˜ p˜ q
exp{nc(p, ˜ p)} where c(p, ˜ p) =
_
˜ p ln
p
˜ p
˜ q ln
q
˜ q
_
. (1.75)
For any integer value of n˜ p with ˜ p > p, we can lower bound Pr{S
n
n˜ p} by the single term
p
Sn
(n˜ p). Thus Pr{S
n
n˜ p} is both upper and lower bounded by quantities that decrease
exponentially with n at the same rate. The di↵erence between the upper bound and the
asymptotic lower bound is essentially negligible for large n. We can express this analytically
by considering the log of the upper bound in (1.73) and the lower bound in (1.75).
lim
n!1
lnPr{S
n
n˜ p}
n
= c(p, ˜ p) where ˜ p > p. (1.76)
In the same way, for ˜ p < p,
lim
n!1
lnPr{S
n
 n˜ p}
n
= c(p, ˜ p) where ˜ p < p. (1.77)
In other words, these Cherno↵ bounds are not only upper bounds, but are also exponentially
tight in the sense of (1.76) and (1.77). In Chapter 9 we will show that this property is typical
for sums of IID rv’s. Thus we see that the Cherno↵ bounds are not ‘just bounds,’ but rather
are bounds that when optimized provide the correct asymptotic exponent for the tails of
the distribution of sums of IID rv’s. In this sense these bounds are quite di↵erent from the
Markov and Chebyshev bounds.
1.5. THE LAWS OF LARGE NUMBERS 37
1.5 The laws of large numbers
The laws of large numbers are a collection of results in probability theory that describe
the behavior of the arithmetic average of n rv’s for large n. For any n rv’s, X
1
, . . . , X
n
,
the arithmetic average is the rv (1/n)

n
i=1
X
i
. Since in any outcome of the experiment,
the sample value of this rv is the arithmetic average of the sample values of X
1
, . . . , X
n
,
this random variable is usually called the sample average. If X
1
, . . . , X
n
are viewed as
successive variables in time, this sample average is called the time average. Under fairly
general assumptions, the standard deviation of the sample average goes to 0 with increasing
n, and, in various ways depending on the assumptions, the sample average approaches the
mean.
These results are central to the study of stochastic processes because they allow us to relate
time averages (i.e., the average over time of individual sample paths) to ensemble averages
(i.e., the mean of the value of the process at a given time). In this section, we develop
and discuss one of these results, the weak law of large numbers for IID rv’s. We also
brieﬂy discuss another of these results, the strong law of large numbers. The strong law
requires considerable patience to understand, and its derivation and fuller discussion are
postponed to Chapter 5 where it is ﬁrst needed. We also discuss the central limit theorem,
partly because it enhances our understanding of the weak law, and partly because of its
importance in its own right.
1.5.1 Weak law of large numbers with a ﬁnite variance
Let X
1
, X
2
, . . . , X
n
be IID rv’s with a ﬁnite mean X and ﬁnite variance o
2
X
. Let S
n
=
X
1
+ · · · + X
n
, and consider the sample average S
n
/n. We saw in (1.44) that o
2
Sn
= no
2
X
.
Thus the variance of S
n
/n is
VAR
_
S
n
n
_
= E
_
✓
S
n
nX
n
◆
2
_
=
1
n
2
E
_
_
S
n
nX
_
2
_
=
o
2
X
n
. (1.78)
This says that the standard deviation of the sample average S
n
/n is o/
p
n, which approaches
0 as n increases. Figure 1.10 illustrates this decrease in the standard deviation of S
n
/n with
increasing n. In contrast, recall that Figure 1.5 illustrated how the standard deviation of
S
n
increases with n. From (1.78), we see that
lim
n!1
E
_
✓
S
n
n
X
◆
2
_
= 0. (1.79)
As a result, we say that S
n
/n converges in mean square to X.
This convergence in mean square says that the sample average, S
n
/n, di↵ers from the mean,
X, by a random variable whose standard deviation approaches 0 with increasing n. This
convergence in mean square is one sense in which S
n
/n approaches X, but the idea of a
sequence of rv’s (i.e., a sequence of functions) approaching a constant is clearly much more
involved than a sequence of numbers approaching a constant. The laws of large numbers
38 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
0 0.25 0.5 0.75 1
F
Zn
(z)
Z
n
=
Sn
n
0
0.2
0.4
0.6
0.8
1
· · · · · · · · · · · · · ·
· · · · · · · · · · ·
· · · · · · · · · · ·
· · · · · · · · · · · · · ·
· · · · · · · · · · · · · ·
n = 4
n = 20
n = 50
Figure 1.10: The same distribution as Figure 1.5, scaled di↵erently to give the distri-
bution function of the sample average Z
n
. It can be visualized that as n increases, the
distribution function of Z
n
becomes increasingly close to a unit step at the mean, 0.25,
of the variables X being summed.
bring out this central idea in a more fundamental, and usually more useful, way. We start
the development by applying the Chebyshev inequality (1.61) to the sample average,
Pr
⇢
¸
¸
¸
S
n
n
X
¸
¸
¸ > ✏
_

o
2
n✏
2
. (1.80)
This is an upper bound on the probability that S
n
/n di↵ers by more than ✏ from its mean, X.
This is illustrated in Figure 1.10 which shows the distribution function of S
n
/n for various
n. The ﬁgure suggests that lim
n!1
F
Sn/n
(z) = 0 for all z < X and lim
n!1
F
Sn/n
(z) = 1
for all z > X. This is stated more cleanly in the following weak law of large numbers,
abbreviated WLLN
Theorem 1.5.1 (WLLN with ﬁnite variance). For each integer n 1, let S
n
= X
1
+
· · · +X
n
be the sum of n IID rv’s with a ﬁnite variance. Then
lim
n!1
Pr
⇢
¸
¸
¸
S
n
n
X
¸
¸
¸ > ✏
_
= 0 for every ✏ > 0. (1.81)
Proof: For every ✏ > 0, Pr
_
|S
n
/n X| > ✏
_
is bounded between 0 and o
2
/n✏
2
. Since the
upper bound goes to 0 with increasing n, the theorem is proved.
Discussion: The algebraic proof above is both simple and rigorous. However, the graphical
description in Figure 1.11 probably provides more intuition about how the limit takes place.
It is important to understand both.
We refer to (1.81) as saying that S
n
/n converges to X in probability. To make sense out of
this, we should view X as a deterministic random variable, i.e., a rv that takes the value X
1.5. THE LAWS OF LARGE NUMBERS 39
1
6
?

1
?
6

2
?
6
1
0
X
2✏ c
1
+c
2
= c 

2
n✏
2
-
F
Sn/n
⇣
⇣ ⇣)
Figure 1.11: Approximation of the distribution function F
Sn/n
of a sample average
by a step function at the mean: From (1.80), the probability c that S
n
/n di↵ers from
X by more than ✏ (i.e., Pr
_
|S
n
/n X| ✏
_
) is at most o
2
/n✏
2
. The complementary
event, where |S
n
/n X| < ✏, has probability 1 c 1 o
2
/n✏
2
. This means that
we can construct a rectangle of width 2✏ centered on X and of height 1 c such that
F
Sn/n
enters the rectangle at the lower left (say at (X ✏, c
1
)) and exits at the upper
right, say at (X + ✏, 1 c
2
)). Now visualize increasing n while holding ✏ ﬁxed. In the
limit, 1 c !1 so Pr
_
|S
n
/n X| ✏
_
!0. Since this is true for every ✏ > 0 (usually
with slower convergence as ✏ gets smaller), F
Sn/n
(z) approaches 0 for every z < X and
approaches 1 for every z > X, i.e., F
Sn/n
approaches a unit step at X. Note that
there are two ‘fudge factors’ here, ✏ and c and, since we are approximating an entire
distribution function, neither can be omitted, except by directly going to a limit as
n !1.
for each sample point of the space. Then (1.81) says that the probability that the absolute
di↵erence, |S
n
/n X|, exceeds any given ✏ > 0 goes to 0 as n !1.
37
One should ask at this point what (1.81) adds to the more speciﬁc bound in (1.80). In
particular (1.80) provides an upper bound on the rate of convergence for the limit in (1.81).
The answer is that (1.81) remains valid when the theorem is generalized. For variables
that are not IID or have an inﬁnite variance, (1.80) is no longer necessarily valid. In some
situations, as we see later, it is valuable to know that (1.81) holds, even if the rate of
convergence is extremely slow or unknown.
One diculty with the bound in (1.80) is that it is extremely loose in most cases. If S
n
/n
actually approached X this slowly, the weak law of large numbers would often be more a
mathematical curiosity than a highly useful result. If we assume that the MGF of X exists
in an open interval around 0, then (1.80) can be strengthened considerably. Recall from
(1.69) and (1.70) that for any ✏ > 0,
Pr
_
S
n
/n X ✏
_
 exp(nµ
X
(X +✏)) (1.82)
Pr
_
S
n
/n X  ✏
_
 exp(nµ
X
(X ✏)), (1.83)
where from Lemma 1.4.1, µ
X
(a) = inf
r
{¸
X
(r) ra} < 0 for a 6= X. Thus, for any ✏ > 0,
Pr
_
|S
n
/n X| ✏
_
 exp[nµ
X
(X +✏)] + exp[nµ
X
(X ✏)]. (1.84)
37
Saying this in words gives one added respect for mathematical notation, and perhaps in this case, it is
preferable to simply understand the mathematical statement (1.81).
40 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
The bound here, for any ﬁxed ✏ > 0, decreases geometrically in n rather than harmonically.
In terms of Figure 1.11, the height of the rectangle must approach 1 at least geometrically
in n.
1.5.2 Relative frequency
We next show that (1.81) (and similarly (1.84)) can be applied to the relative frequency of
an event as well as to the sample average of a random variable. Suppose that A is some
event in a single experiment, and that the experiment is independently repeated n times.
Then, in the probability model for the n repetitions, let A
i
be the event that A occurs at
the ith trial, 1  i  n. The events A
1
, A
2
, . . . , A
n
are then IID.
If we let I
A
i
be the indicator rv for A on the ith trial, then the rv S
n
= I
A
1
+I
A
2
+· · · +I
An
is the number of occurrences of A over the n trials. It follows that
relative frequency of A =
S
n
n
=

n
i=1
I
A
i
n
. (1.85)
Thus the relative frequency of A is the sample average of the binary rv’s I
A
i
, and everything
we know about the sum of IID rv’s applies equally to the relative frequency of an event. In
fact, everything we know about sums of IID binary rv’s applies to relative frequency.
1.5.3 The central limit theorem
The weak law of large numbers says that with high probability, S
n
/n is close to X for
large n, but it establishes this via an upper bound on the tail probabilities rather than an
estimate of what F
Sn/n
looks like. If we look at the shape of F
Sn/n
for various values of n in
the example of Figure 1.10, we see that the function F
Sn/n
becomes increasingly compressed
around X as n increases (in fact, this is the essence of what the weak law is saying). If we
normalize the random variable S
n
/n to 0 mean and unit variance, we get a normalized rv,
Z
n
= (S
n
/n X)
p
n/o. The distribution function of Z
n
is illustrated in Figure 1.12 for
the same underlying X as used for S
n
/n in Figure 1.10. The curves in the two ﬁgures are
the same except that each curve has been horizontally scaled by
p
n in Figure 1.12.
Inspection of Figure 1.12 shows that the normalized distribution functions there seem to be
approaching a limiting distribution. The critically important central limit theorem states
that there is indeed such a limit, and it is the normalized Gaussian distribution function.
Theorem 1.5.2 (Central limit theorem (CLT)). Let X
1
, X
2
, . . . be IID rv’s with ﬁnite
mean X and ﬁnite variance o
2
. Then for every real number z,
lim
n!1
Pr
⇢
S
n
nX
o
p
n
 z
_
= (z), (1.86)
where (z) is the normal distribution function, i.e., the Gaussian distribution with mean 0
and variance 1,
(z) =
_
z
1
1
p
2⇡
exp(
y
2
2
) dy.
1.5. THE LAWS OF LARGE NUMBERS 41
-2 -1 0 1 2
0
0.2
0.4
0.6
0.8
1
F
Zn
(z)
Z
n
=
SnnX

X
for
n = 4, 20, 50. Note that as n increases, the distribution function of Z
n
slowly starts to
resemble the normal distribution function.
Discussion: The rv Z
n
= (S
n
nX)/(o
p
n), for each n 1 on the left side of (1.86),
has mean 0 and variance 1. The central limit theorem (CLT), as expressed in (1.86), says
that the sequence of distribution functions, F
Z
1
(z), F
Z
2
(z), . . . converges at each value of z
to (z) as n ! 1. In other words, lim
n!1
F
Zn
(z) = (z) for each z 2 R. This is called
convergence in distribution, since it is the sequence of distribution functions, rather than
the sequence of rv’s that is converging. The theorem is illustrated by Figure 1.12.
The CLT tells us quite a bit about how F
Sn/n
converges to a step function at X. To see
this, rewrite (1.86) in the form
lim
n!1
Pr
⇢
S
n
n
X 
oz
p
n
_
= (z). (1.87)
This is illustrated in Figure 1.13 where we have used (z) as an approximation for the
probability on the left.
The reason why the word central appears in the CLT can also be seen from (1.87). Asymp-
totically, we are looking at a limit (as n ! 1) of the probability that the sample average
di↵ers from the mean by at most a quantity going to 0 as 1/
p
n. This should be contrasted
with the corresponding optimized Cherno↵ bound in (1.84) which looks at the limit of the
probability that the sample average di↵ers from the mean by at most a constant amount.
Those latter results are exponentially decreasing in n and are known as large deviation
results.
Theorem 1.5.2 says nothing about the rate of convergence to the normal distribution. The
Berry-Esseen theorem (see, for example, Feller, [8]) provides some guidance about this for
42 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
(z)
X +
z
p
4n
X +
z
p
n
1
0
X
F
Sn/n
F
S4n/4n
Figure 1.13: Approximation of the distribution function F
Sn/n
of a sample average by
a Gaussian distribution of the same mean and variance. Whenever n is increased by
a factor of 4, the curve is horizontally scaled inward toward X by a factor of 2. The
CLT says both that these curves are scaled horizontally as 1/
p
n and also that they are
better approximated by the Gaussian of the given mean and variance as n increases.
cases in which the third central moment E
⇥
|X X|
3
⇤
exists. This theorem states that
¸
¸
¸
¸
Pr
⇢
(S
n
nX)
o
p
n
 z
_
(z)
¸
¸
¸
¸

CE
⇥
|X X|
3
⇤
o
3
p
n
. (1.88)
where C can be upper bounded by 0.766 (later improved to 0.4784). We will come back
shortly to discuss convergence in greater detail.
The CLT helps explain why Gaussian rv’s play such a central role in probability theory.
In fact, many of the cookbook formulas of elementary statistics are based on the tacit
assumption that the underlying variables are Gaussian, and the CLT helps explain why
these formulas often give reasonable results.
One should be careful to avoid reading more into the CLT than it says. For example, the
normalized sum, (S
n
nX)/o
p
n need not have a density that is approximately Gaussian.
In fact, if the underlying variables are discrete, the normalized sum is discrete and has no
density. The PMF of the normalized sum might have very detailed and wild ﬁne structure;
this does not disappear as n increases, but becomes “integrated out” in the distribution
function.
A proof of the CLT requires mathematical tools that will not be needed subsequently.
38
Thus we give a proof only for the binomial case. Before doing this, however, we will show
that the PMF for S
n
in the binomial approaches a sampled form of the Gaussian density.
This detailed form of the PMF does not follow from the CLT and is often valuable in its
own right.
Theorem 1.5.3. Let {X
i
; i 1} be a sequence of IID binary rv’s with p = p
X
(1) > 0 and
q = 1 p = p
X
(0) > 0. Let S
n
= X
1
+· · · +X
n
for each n 1 and let ↵ be a ﬁxed constant
38
Many elementary texts provide ‘simple proofs,’ using transform techniques, but, among other issues,
these techniques often indicate that the normalized sum has a density that approaches the Gaussian density;
this is incorrect for all discrete rv’s. The simplest correct proof known by the auther is given by Feller ([7]
and [8]).
1.5. THE LAWS OF LARGE NUMBERS 43
satisfying 1/2 < ↵ < 2/3. Then constants C and n
o
exist such that for all integer k such
that |k np|  n
↵
,
p
Sn
(k) =
1
p
2⇡npq
exp
✓
(k np)
2
2npq
◆
_
1 ±Cn
3↵2
_
for n n
o
, (1.89)
where this ‘equation’ is to be interpreted as an upper bound when the ± sign is replaced with
+ and a lower bound with .
Note that n
3↵2
goes to 0 with increasing n for ↵ < 2/3, so the ratio of the right and left
sides of (1.89) approach 1 at the rate n
3↵2
. This is independent of k within the range
|k n|  n
↵
. We will see why this fussiness is required when we go from the PMF to the
distribution function.
Proof: Let ˜ p = k/n, ˜ q = 1 ˜ p, and ✏(k, n) = ˜ p p. We abbreviate ✏(k, n) as ✏ where k
and n are clear from the context. From the upper and lower bound to p
Sn
(k) in (1.24) and
(1.26), we can express p
Sn
(k) as
p
Sn
(k) =
_
1 ±n
1/2
_
p
2⇡n˜ p˜ q
expn[˜ p ln(p/˜ p) + ˜ q ln(q/˜ q)]
=
_
1 ±n
1/2
_
_
2⇡n(p +✏)(q ✏)
exp
_
n(p +✏) ln
✓
1 +
✏
p
◆
n(q ✏) ln
✓
1
✏
q
◆_
=
_
1 ±n
1/2
_
_
2⇡n(p +✏)(q ✏)
exp
_
n
✓
✏
2
2p

k=dnp+z
0
p
npq e
1
p
2⇡npq
exp
✓
(k np)
2
2npq
◆
_
_
.
As seen from (1.93), each term in (1.94) satisﬁes |k np|  n
↵
, which justiﬁes the bounds
in the following sum. That following sum can be viewed as a Riemann sum for the integral
in (1.86). Thus the sum approaches the integral as n
1/2
. Taking the limit as n ! 1 in
(1.94), the term Cn
3↵2
approaches 0, justifying (1.92). The theorem follows by taking the
limit z
0
!1.
Since the CLT provides such explicit information about the convergence of S
n
/n to X, it
is reasonable to ask why the weak law of large numbers (WLLN) is so important. The
ﬁrst reason is that the WLLN is so simple that it can be used to give clear insights into
situations where the CLT could confuse the issue. A second reason is that the CLT requires
a variance, where as we see next, the WLLN does not. A third reason is that the WLLN
can be extended to many situations in which the variables are not independent and/or not
identically distributed.
39
A ﬁnal reason is that the WLLN provides an upper bound on the
tails of F
Sn/n
, whereas the CLT provides only an approximation.
1.5.4 Weak law with an inﬁnite variance
We now establish the WLLN without assuming a ﬁnite variance.
Theorem 1.5.4 (WLLN). For each integer n 1, let S
n
= X
1
+· · ·+X
n
where X
1
, X
2
, . . .
are IID rv’s satisfying E[|X|] < 1. Then for any ✏ > 0,
lim
n!1
Pr
⇢
¸
¸
¸
S
n
n
E[X]
¸
¸
¸ > ✏
_
= 0. (1.95)
39
Central limit theorems also hold in many of these more general situations, but they do not hold as
widely as the WLLN.
1.5. THE LAWS OF LARGE NUMBERS 45
Proof:
40
We use a truncation argument; such arguments are used frequently in dealing
with rv’s that have inﬁnite variance. The underlying idea in these arguments is important,
but some less important details are treated in Exercise 1.43. Let b be a positive number
(which we later take to be increasing with n), and for each variable X
i
, deﬁne a new rv
˘
X
i
(see Figure 1.14) by
˘
X
i
=
_
_
_
X
i
for E[X] b  X
i
 E[X] +b
E[X] +b for X
i
> E[X] +b
E[X] b for X
i
< E[X] b.
(1.96)
X X X b
X + b
F
˘
X
FX
Figure 1.14: The truncated rv
˘
X for a given rv X has a distribution function which is
truncated at X ±b.
The truncated variables
˘
X
i
are IID and, because of the truncation, must have a ﬁnite
second moment. Thus the WLLN applies to the sample average
˘
S
n
=
˘
X
1
+ · · ·
˘
X
n
. More
particularly, using the Chebshev inequality in the form of (1.80) on
˘
S
n
/n, we get
Pr
_¸
¸
¸
¸
¸
˘
S
n
n
E
_
˘
X
_
¸
¸
¸
¸
¸
>
✏
2
_

4o
2
˘
X
n✏
2

8bE[|X|]
n✏
2
,
where Exercise 1.43 demonstrates the ﬁnal inequality. Exercise 1.43 also shows that E
⇥
˘
X
⇤
approaches E[X] as b !1 and thus that
Pr
_¸
¸
¸
¸
¸
˘
S
n
n
E[X]
¸
¸
¸
¸
¸
> ✏
_

8bE[|X|]
n✏
2
, (1.97)
for all suciently large b. This bound also applies to S
n
/n in the case where S
n
=
˘
S
n
, so
we have the following bound (see Exercise 1.43 for further details):
Pr
⇢¸
¸
¸
¸
S
n
n
E[X]
¸
¸
¸
¸
> ✏
_
 Pr
_¸
¸
¸
¸
¸
˘
S
n
n
E[X]
¸
¸
¸
¸
¸
> ✏
_
+ Pr
_
S
n
6=
˘
S
n
_
. (1.98)
The original sum S
n
is the same as
˘
S
n
unless one of the X
i
has an outage, i.e., |X
i
X| > b.
Thus, using the union bound, Pr
_
S
n
6=
˘
S
n
_
 nPr
_
|X
i
X| > b
_
. Substituting this and
(1.97) into (1.98),
Pr
⇢¸
¸
¸
¸
S
n
n
E[X]
¸
¸
¸
¸
> ✏
_

8bE[|X|]
n✏
2
+
n
b
[b Pr{ |X E[X] | > b}] . (1.99)
40
The details of this proof can be omitted without loss of continuity. However, truncation arguments are
important in many places and should be understood at some point.
46 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
We now show that for any ✏ > 0 and c > 0, Pr
_
|S
n
/n X| ✏
_
 c for all suciently
large n. We do this, for given ✏, c, by choosing b(n) for each n so that the ﬁrst term in
(1.99) is equal to c/2. Thus b(n) = nc✏
2
/16E[|X|]. This means that n/b(n) in the second
term is independent of n. Now from (1.60), lim
b!1
bPr
_
|X X| > b
_
= 0, so by choosing
b(n) suciently large (and thus n suciently large), the second term in (1.99) is also at
most c/2.
1.5.5 Convergence of random variables
This section has developed a number of results about how the sequence of sample averages,
{S
n
/n; n 1}, for a sequence of IID rv’s {X
i
; i 1} approaches the mean X. In the case
of the CLT, the limiting distribution around the mean is also speciﬁed to be Gaussian. At
the outermost intuitive level, i.e., at the level most useful when ﬁrst looking at some very
complicated set of issues, viewing the limit of the sample averages as being essentially equal
to the mean is highly appropriate.
At the next intuitive level down, the meaning of the word essentially becomes important
and thus involves the details of the above laws. All of the results involve how the rv’s S
n
/n
change with n and become better and better approximated by X. When we talk about a
sequence of rv’s (namely a sequence of functions on the sample space) being approximated
by a rv or numerical constant, we are talking about some kind of convergence, but it clearly
is not as simple as a sequence of real numbers (such as 1/n for example) converging to some
given number (0 for example).
The purpose of this section, is to give names and deﬁnitions to these various forms of
convergence. This will give us increased understanding of the laws of large numbers already
developed, but, equally important, it will allow us to develop another law of large numbers
called the strong law of large numbers (SLLN). Finally, it will put us in a position to use
these convergence results later for sequences of rv’s other than the sample averages of IID
rv’s.
We discuss four types of convergence in what follows, convergence in distribution, in prob-
ability, in mean square, and with probability 1. For the ﬁrst three, we ﬁrst recall the type
of large-number result with that type of convergence and then give the general deﬁnition.
For convergence with probability 1 (WP1), we will deﬁne this type of convergence and then
provide some understanding of what it means. This will then be used in Chapter 5 to state
and prove the SLLN.
We start with the central limit theorem, which, from (1.86) says
lim
n!1
Pr
⇢
S
n
nX
p
no
 z
_
=
_
z
1
1
p
2⇡
exp
✓
x
2
2
◆
dx for every z 2 R.
This is illustrated in Figure 1.12 and says that the sequence (in n) of distribution functions
Pr
_
SnnX
p
n
 z
_
converges at every z to the normal distribution function at z. This is an
example of convergence in distribution.
1.5. THE LAWS OF LARGE NUMBERS 47
Deﬁnition 1.5.1. A sequence of random variables, Z
1
, Z
2
, . . . , converges in distribution
to a random variable Z if lim
n!1
F
Zn
(z) = F
Z
(z) at each z for which F
Z
(z) is continuous.
For the CLT example, the rv’s that converge in distribution are {
SnnX
p
n
; n 1}, and they
converge in distribution to the normal Gaussian rv.
Convergence in distribution does not say that the rv’s themselves converge in any reasonable
sense, but only that their distribution functions converge. For example, let Y
1
, Y
2
, . . . , be
IID rv’s with the distribution function F
Y
. For each n 1, if we let let Z
n
= Y
n
+ 1/n,
then it is easy to see that {Z
n
; n 1} converges in distribution to Y . However (assuming
Y has variance o
2
Y
and is independent of each Z
n
), we see that Z
n
Y has variance 2o
2
Y
.
Thus Z
n
does not get close to Y as n ! 1 in any reasonable sense, and Z
n
Z
m
does
not get small as n and m both get large.
41
As an even more trivial example, the sequence
{Y
n
; n 1} converges in distribution to Y .
For the CLT, it is the rv’s
SnnX
p
n
that converge in distribution to the normal. As shown in
Exercise 1.46, however, the rv
SnnX
p
n

S
2n
2nX
p
2n
is not close to 0 in any reasonable sense,
even though the two terms have distribution functions that are very close for large n.
For the next type of convergence of rv’s, the WLLN, in the form of (1.95), says that
lim
n!1
Pr
⇢
¸
¸
¸
S
n
n
X
¸
¸
¸ > ✏
_
= 0 for every ✏ > 0.
This is an example of convergence in probability, as deﬁned below:
Deﬁnition 1.5.2. A sequence of random variables Z
1
, Z
2
, . . . , converges in probability to
a rv Z if lim
n!1
Pr{|Z
n
Z| > ✏} = 0 for every ✏ > 0.
For the WLLN example, Z
n
in the deﬁnition is the sample average S
n
/n and Z is the
constant rv X. It is probably simpler and more intuitive in thinking about convergence
of rv’s to think of the sequence of rv’s {Y
n
= Z
n
Z; n 1} as converging to 0 in some
sense.
42
As illustrated in Figure 1.10, convergence in probability means that {Y
n
; n 1}
converges in distribution to a unit step function at 0.
An equivalent statement, as illustrated in Figure 1.11, is that {Y
n
; n 1} converges in
probability to 0 if lim
n!1
F
Yn
(y) = 0 for all y < 0 and lim
n!1
F
Yn
(y) = 1 for all y > 0.
This shows that convergence in probability is a special case of convergence in distribution,
since with convergence in probability, the sequence F
Yn
of distribution functions converges
to a unit step at 0. Note that lim
n!1
F
Yn
(y) is not speciﬁed at y = 0. However, the step
function is not continuous at 0, so the limit there need not be speciﬁed for convergence in
distribution.
41
In fact, saying that a sequence of rv’s converges in distribution is unfortunate but standard terminology.
It would be just as concise, and far less confusing, to say that a sequence of distribution functions converge
rather than saying that a sequence of rv’s converge in distribution.
42
Deﬁnition 1.5.2 gives the impression that convergence to a rv Z is more general than convergence to
a constant or convergence to 0, but converting the rv’s to Yn = Zn Z makes it clear that this added
generality is quite superﬁcial.
48 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
Convergence in probability says quite a bit more than convergence in distribution. As an im-
portant example of this, consider the di↵erence Y
n
Y
m
for n and m both large. If {Y
n
; n
1} converges in probability to 0, then Y
n
and Y
m
are both close to 0 with high probability for
large n and m, and thus close to each other. More precisely, lim
m!1,n!1
Pr{|Y
n
Y
m
| > ✏} =
0 for every ✏ > 0. If the sequence {Y
n
; n 1} merely converges in distribution to some
arbitrary distribution, then, as we saw, Y
n
Y
m
can be large with high probability, even
when n and m are large. Another example of this is given in Exercise 1.46.
It appears paradoxical that the CLT is more explicit about the convergence of S
n
/n to X
than the weak law, but it corresponds to a weaker type of convergence. The resolution of
this paradox is that the sequence of rv’s in the CLT is {
SnnX
p
n
; n 1}. The presence of
p
n in the denominator of this sequence provides much more detailed information about
how S
n
/n approaches X with increasing n than the limiting unit step of F
Sn/n
itself. For
example, it is easy to see from the CLT that lim
n!1
F
Sn/n
(X) = 1/2, which can’t be
derived directly from the weak law.
Yet another kind of convergence is convergence in mean square (MS). An example of this,
for the sample average S
n
/n of IID rv’s with a variance, is given in (1.79), repeated below:
lim
n!1
E
_
✓
S
n
n
X
◆
2
_
= 0.
The general deﬁnition is as follows:
Deﬁnition 1.5.3. A sequence of rv’s Z
1
, Z
2
, . . . , converges in mean square (MS) to a rv
Z if lim
n!1
E
⇥
(Z
n
Z)
2
⇤
= 0.
Our derivation of the weak law of large numbers (Theorem 1.5.1) was essentially based on
the MS convergence of (1.79). Using the same approach, Exercise 1.45 shows in general that
convergence in MS implies convergence in probability. Convergence in probability does not
imply MS convergence, since as shown in Theorem 1.5.4, the weak law of large numbers
holds without the need for a variance.
Figure 1.15 illustrates the relationship between these forms of convergence, i.e., mean square
convergence implies convergence in probability, which in turn implies convergence in dis-
tribution. The ﬁgure also shows convergence with probability 1 (WP1), which is the next
form of convergence to be discussed.
1.5.6 Convergence with probability 1
Convergence with probability 1, abbreviated as convergence WP1, is often referred to as
convergence a.s. (almost surely) and convergence a.e. (almost everywhere). The strong
law of large numbers, which is discussed brieﬂy in this section and further discussed and
proven in various forms in Chapters 5 and 9, provides an extremely important example of
convergence WP1. The general deﬁnition is as follows:
1.5. THE LAWS OF LARGE NUMBERS 49
"!
#
"!
#
Distribution
In probability
WP1
MS
Figure 1.15: Relationship between di↵erent kinds of convergence: Convergence in dis-
tribution is the most general and is implied by all the others. Convergence in probability
is the next most general and is implied both by convergence with probability 1 (WP1)
and by mean square (MS) convergence, neither of which implies the other.
Deﬁnition 1.5.4. Let Z
1
, Z
2
, . . . , be a sequence of rv’s in a sample space ⌦ and let Z be
another rv in ⌦. Then {Z
n
; n 1} is deﬁned to converge to Z with probability 1 (WP1) if
Pr
_
. 2 ⌦ : lim
n!1
Z
n
(.) = Z(.)
_
= 1. (1.100)
The condition Pr{. 2 ⌦ : lim
n!1
Z
n
(.) = Z(.)} = 1 is often stated more compactly as
Pr{lim
n
Z
n
= Z} = 1, and even more compactly as lim
n
Z
n
= Z WP1, but the form here
is the simplest for initial understanding. As discussed in Chapter 5. the SLLN says that if
X
1
, X
2
, . . . are IID with E[|X|] < 1, then the sequence of sample averages, {S
n
/n; n 1}
converges WP1 to X.
In trying to understand (1.100), note that each sample point . of the underlying sample
space ⌦ maps to a sample value Z
n
(.) of each rv Z
n
, and thus maps to a sample path
{Z
n
(.); n 1}. For any given ., such a sample path is simply a sequence of real numbers.
That sequence of real numbers might converge to Z(.) (which is a real number for the
given .), it might converge to something else, or it might not converge at all. Thus a set of
. exists for which the corresponding sample path {Z
n
(.); n 1} converges to Z(.), and a
second set for which the sample path converges to something else or does not converge at
all. Convergence WP1 of the sequence of rv’s is thus deﬁned to occur when the ﬁrst set of
sample paths above is an event that has probability 1.
For each ., the sequence {Z
n
(.); n 1} is simply a sequence of real numbers, so we brieﬂy
review what the limit of such a sequence is. A sequence of real numbers b
1
, b
2
, . . . is said to
have a limit b if, for every ✏ > 0, there is an integer m
✏
such that |b
n
b|  ✏ for all n m
✏
.
An equivalent statement is that b
1
, b
2
, . . . , has a limit b if, for every integer k 1, there is
an integer m(k) such that |b
n
b|  1/k for all n m(k).
Figure 1.16 illustrates this deﬁnition for those, like the author, whose eyes blur on the
second or third ‘there exists’, ‘such that’, etc. in a statement. As illustrated, an important
aspect of convergence of a sequence {b
n
; n 1} of real numbers is that b
n
becomes close to
b for large n and stays close for all suciently large values of n.
50 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
b1
b2
b3
b4
b5
b6
b7
m(1)
m(2) m(3)
m(4)
b + 1
b + 1/2
b + 1/3
b 1/3
b 1/2
b 1
b
Figure 1.16: Illustration of a sequence of real numbers b
1
, b
2
, . . . that converge to a
number b. The ﬁgure illustrates an integer m(1) such that for all n m(1), b
n
lies in
the interval b ± 1. Similarly, for each k 1, there is an integer m(k) such that b
n
lies
in b ± 1/k for all n m(k). Thus lim
n!1
b
n
= b means that for a sequence of ever
tighter constraints, the kth constraint can be met for all suciently large n, (i.e., all
n m(k)). Intuitively, convergence means that the elements b
1
, b
2
, . . . get close to b
and stay close. The sequence of positive integers m(1), m(2), . . . is non-decreasing, but
otherwise arbitrary, depending only on the sequence {b
n
; n 1}. For sequences that
converge very slowly, the integers m(1), m(2), . . . are simply correspondingly larger.
Figure 1.17 gives an example of a sequence of real numbers that does not converge. Intu-
itively, this sequence is close to 0 (and in fact identically equal to 0) for most large n, but
it doesn’t stay close, because of ever more rare outages.
q q q q q q q q q q q q q q q q q q q q q q q q q q
b1 b5 b25
0
3/4
Figure 1.17: Illustration of a non-convergent sequence of real numbers b
1
, b
2
, . . . . The
sequence is deﬁned by b
n
= 3/4 for n = 1, 5, 25, . . . , 5
j
, . . . for all integer j 0. For all
other n, b
n
= 0. The terms for which b
n
6= 0 become increasingly rare as n !1. Note
that b
n
2 [1, 1] for all n, but there is no m(2) such that b
n
2 [
1
2
,
1
2
] for all n m(2).
Thus the sequence does not converge.
The following example illustrates how a sequence of rv’s can converge in probability but
not converge WP1. The example also provides some clues as to why convergence WP1 is
important.
Example 1.5.1. Consider a sequence {Y
n
; n 1} of rv’s for which the sample paths
constitute the following slight variation of the sequence of real numbers in Figure 1.17.
In particular, as illustrated in Figure 1.18, the non-zero term at n = 5
j
in Figure 1.17 is
replaced by a non-zero term at a randomly chosen n in the interval
43
[5
j
, 5
j+1
).
43
There is no special signiﬁcance to the number 5 here other than making the ﬁgure easy to visualize. We
could replace 5 by 2 or 3 etc.
1.6. RELATION OF PROBABILITY MODELS TO THE REAL WORLD 51
q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q
1 5 25
0
3/4
Figure 1.18: Illustration of a sample path of a sequence of rv’s {Y
n
; n 0} where, for
each j 0, Y
n
= 1 for an equiprobable choice of n 2 [5
j
, 5
j+1
) and Y
n
= 0 otherwise.
Since each sample path contains a single one in each segment [5
j
, 5
j+1
), and contains zero’s
elsewhere, none of the sample paths converge. In other words, Pr{. : limY
n
(.) = 0} = 0.
On the other hand Pr{Y
n
= 0} = 1 5
j
for 5
j
 n < 5
j+1
, so lim
n!1
Pr{Y
n
= 0} = 1.
Thus this sequence of rv’s converges to 0 in probability, but does not converge to 0 WP1.
This sequence also converges in mean square and (since it converges in probability) in
distribution. Thus we have shown (by example) that convergence WP1 is not implied by
any of the other types of convergence we have discussed. We will show in Section 5.2 that
convergence WP1 does imply convergence in probability and in distribution but not in mean
square (as illustrated in Figure 1.15).
The interesting point in this example is that this sequence of rv’s is not bizarre (although it
is somewhat specialized to make the analysis simple). Another important point is that this
deﬁnition of convergence has a long history of being accepted as the ‘useful,’ ‘natural,’ and
‘correct’ way to deﬁne convergence for a sequence of real numbers. Thus it is not surprising
that convergence WP1 will turn out to be similarly useful for sequences of rv’s.
There is a price to be paid in using the concept of convergence WP1. We must then look
at the entire sequence of rv’s and can no longer analyze ﬁnite n-tuples and then go to the
limit as n ! 1. This requires a signiﬁcant additional layer of abstraction, which involves
additional mathematical precision and initial loss of intuition. For this reason we put o↵
further discussion of convergence WP1 and the SLLN until Chapter 5 where it is needed.
1.6 Relation of probability models to the real world
Whenever experienced and competent engineers or scientists construct a probability model
to represent aspects of some system that either exists or is being designed for some applica-
tion, they must acquire a deep knowledge of the system and its surrounding circumstances,
and concurrently consider various types of probability models used in probabilistic analyses
of the same or similar systems. Usually very simple probability models help in understanding
the real-world system, and knowledge about the real-world system helps in understanding
what aspects of the system are well-modeled by a given probability model. For a text such
as this, there is insucient space to understand the real-world aspects of each system that
might be of interest. We must use the language of various canonical real-world systems for
motivation and insight when studying probability models for various classes of systems, but
such models must necessarily be chosen more for their tutorial than practical value.
52 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
There is a danger, then, that readers will come away with the impression that analysis is
more challenging and important than modeling. To the contrary, for work on real-world
systems, modeling is almost always more dicult, more challenging, and more important
than analysis. The objective here is to provide the necessary knowledge and insight about
probabilistic models so that the reader can later combine this with a deep understanding
of particular real application areas. This will result in a useful interactive use of models,
analysis, and experimentation.
In this section, our purpose is not to learn how to model real-world problems, since, as
said above, this requires deep and specialized knowledge of whatever application area is of
interest. Rather it is to understand the following conceptual problem that was posed in
Section 1.1. Suppose we have a probability model of some real-world experiment involving
randomness in the sense expressed there. When the real-world experiment being modeled is
performed, there is an outcome, which presumably is one of the outcomes of the probability
model, but there is no observable probability.
It appears to be intuitively natural, for experiments that can be carried out repeatedly
under essentially the same conditions, to associate the probability of a given event with
the relative frequency of that event over many repetitions. We now have the background
to understand this approach. We ﬁrst look at relative frequencies within the probability
model, and then within the real world.
1.6.1 Relative frequencies in a probability model
We have seen that for any probability model, an extended probability model exists for n
IID idealized experiments of the original model. For any event A in the original model,
the indicator function I
A
is a random variable, and the relative frequency of A over n IID
experiments is the sample average of n IID rv’s each with the distribution of I
A
. From
the weak law of large numbers, this relative frequency converges in probability to E[I
A
] =
Pr{A}. By taking the limit n !1, the strong law of large numbers says that the relative
frequency of A converges with probability 1 to Pr{A}.
In plain English, this says that for large n, the relative frequency of an event (in the n-
repetition IID model) is essentially the same as the probability of that event. The word
essentially is carrying a great deal of hidden baggage. For the weak law, for any ✏, c > 0,
the relative frequency is within some ✏ of Pr{A} with a conﬁdence level 1 c whenever
n is suciently large. For the strong law, the ✏ and c are avoided, but only by looking
directly at the limit n ! 1. Despite the hidden baggage, though, relative frequency and
probability are related as indicated.
1.6.2 Relative frequencies in the real world
In trying to sort out if and when the laws of large numbers have much to do with real-world
experiments, we should ignore the mathematical details for the moment and agree that for
large n, the relative frequency of an event A over n IID trials of an idealized experiment
is essentially Pr{A}. We can certainly visualize a real-world experiment that has the same
1.6. RELATION OF PROBABILITY MODELS TO THE REAL WORLD 53
set of possible outcomes as the idealized experiment and we can visualize evaluating the
relative frequency of A over n repetitions with large n. If that real-world relative frequency
is essentially equal to Pr{A}, and this is true for the various events A of greatest interest,
then it is reasonable to hypothesize that the idealized experiment is a reasonable model for
the real-world experiment, at least so far as those given events of interest are concerned.
One problem with this comparison of relative frequencies is that we have carefully speciﬁed
a model for n IID repetitions of the idealized experiment, but have said nothing about how
the real-world experiments are repeated. The IID idealized experiments specify that the
conditional probability of A at one trial is the same no matter what the results of the other
trials are. Intuitively, we would then try to isolate the n real-world trials so they don’t a↵ect
each other, but this is a little vague. The following examples help explain this problem and
several others in comparing idealized and real-world relative frequencies.
Example 1.6.1. Coin tossing: Tossing coins is widely used as a way to choose the ﬁrst
player in various games, and is also sometimes used as a primitive form of gambling. Its
importance, however, and the reason for its frequent use, is its simplicity. When tossing
a coin, we would argue from the symmetry between the two sides of the coin that each
should be equally probable (since any procedure for evaluating the probability of one side
should apply equally to the other). Thus since H and T are the only outcomes (the remote
possibility of the coin balancing on its edge is omitted from the model), the reasonable and
universally accepted model for coin tossing is that H and T each have probability 1/2.
On the other hand, the two sides of a coin are embossed in di↵erent ways, so that the mass
is not uniformly distributed. Also the two sides do not behave in quite the same way when
bouncing o↵ a surface. Each denomination of each currency behaves slightly di↵erently in
this respect. Thus, not only do coins violate symmetry in small ways, but di↵erent coins
violate it in di↵erent ways.
How do we test whether this e↵ect is signiﬁcant? If we assume for the moment that succes-
sive tosses of the coin are well-modeled by the idealized experiment of n IID trials, we can
essentially ﬁnd the probability of H for a particular coin as the relative frequency of H in a
suciently large number of independent tosses of that coin. This gives us slightly di↵erent
relative frequencies for di↵erent coins, and thus slightly di↵erent probability models for
di↵erent coins.
The assumption of independent tosses is also questionable. Consider building a carefully en-
gineered machine for tossing coins and using it in a vibration-free environment. A standard
coin is inserted into the machine in the same way for each toss and we count the number
of heads and tails. Since the machine has essentially eliminated the randomness, we would
expect all the coins, or almost all the coins, to come up the same way — the more precise
the machine, the less independent the results. By inserting the original coin in a random
way, a single trial might have equiprobable results, but successive tosses are certainly not
independent. The successive trials would be closer to independent if the tosses were done
by a slightly inebriated individual who tossed the coins high in the air.
The point of this example is that there are many di↵erent coins and many ways of tossing
them, and the idea that one model ﬁts all is reasonable under some conditions and not
under others. Rather than retreating into the comfortable world of theory, however, note
54 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
that we can now ﬁnd the relative frequency of heads for any given coin and essentially for
any given way of tossing that coin.
44
Example 1.6.2. Binary data: Consider the binary data transmitted over a communica-
tion link or stored in a data facility. The data is often a mixture of encoded voice, video,
graphics, text, etc., with relatively long runs of each, interspersed with various protocols
for retrieving the original non-binary data.
The simplest (and most common) model for this is to assume that each binary digit is 0 or
1 with equal probability and that successive digits are statistically independent. This is the
same as the model for coin tossing after the trivial modiﬁcation of converting {H, T} into
{0, 1}. This is also a rather appropriate model for designing a communication or storage
facility, since all n-tuples are then equiprobable (in the model) for each n, and thus the
facilities need not rely on any special characteristics of the data. On the other hand, if one
wants to compress the data, reducing the required number of transmitted or stored bits per
incoming bit, then a more elaborate model is needed.
Developing such an improved model would require ﬁnding out more about where the data
is coming from — a naive application of calculating relative frequencies of n-tuples would
probably not be the best choice. On the other hand, there are well-known data compression
schemes that in essence track dependencies in the data and use them for compression in a
coordinated way. These schemes are called universal data-compression schemes since they
don’t rely on a probability model. At the same time, they are best analyzed by looking at
how they perform for various idealized probability models.
The point of this example is that choosing probability models often depends heavily on how
the model is to be used. Models more complex than IID binary digits are usually based on
what is known about the input processes. Measuring relative frequencies and associating
them with probabilities is the basic underlying conceptual connection between real-world
and models, but in practice this is essentially the relationship of last resort. For most of
the applications we will consider, there is a long history of modeling to build on, with
experiments as needed.
Example 1.6.3. Fable: In the year 2008, the ﬁnancial structure of the USA failed and
the world economy was brought to its knees. Much has been written about the role of greed
on Wall Street and incompetence in Washington. Another aspect of the collapse, however,
was a widespread faith in stochastic models for limiting risk. These models encouraged
people to engage in investments that turned out to be far riskier than the models predicted.
These models were created by some of the brightest PhD’s from the best universities, but
they failed miserably because they modeled everyday events very well, but modeled the rare
events and the interconnection of events poorly. They failed badly by not understanding
their application, and in particular, by trying to extrapolate typical behavior when their
primary goal was to protect against highly atypical situations. The moral of the fable is
44
We are not suggesting that distinguishing di↵erent coins for the sake of coin tossing is an important
problem. Rather, we are illustrating that even in such a simple situation, the assumption of identically
prepared experiments is questionable and the assumption of independent experiments is questionable. The
extension to n repetitions of IID experiments is not necessarily a good model for coin tossing. In other
words, one has to question not only the original model but also the n-repetition model.
1.6. RELATION OF PROBABILITY MODELS TO THE REAL WORLD 55
that brilliant analysis is not helpful when the modeling is poor; as computer engineers say,
“garbage in, garbage out.”
The examples above show that the problems of modeling a real-world experiment are often
connected with the question of creating a model for a set of experiments that are not
exactly the same and do not necessarily correspond to the notion of independent repetitions
within the model. In other words, the question is not only whether the probability model is
reasonable for a single experiment, but also whether the IID repetition model is appropriate
for multiple copies of the real-world experiment.
At least we have seen, however, that if a real-world experiment can be performed many times
with a physical isolation between performances that is well modeled by the IID repetition
model, then the relative frequencies of events in the real-world experiment correspond to
relative frequencies in the idealized IID repetition model, which correspond to probabilities
in the original model. In other words, under appropriate circumstances, the probabilities
in a model become essentially observable over many repetitions.
We will see later that our emphasis on IID repetitions was done for simplicity. There are
other models for repetitions of a basic model, such as Markov models, that we study later.
These will also lead to relative frequencies approaching probabilities within the repetition
model. Thus, for repeated real-world experiments that are well modeled by these repetition
models, the real world relative frequencies approximate the probabilities in the model.
1.6.3 Statistical independence of real-world experiments
We have been discussing the use of relative frequencies of an event A in a repeated real-
world experiment to test Pr{A} in a probability model of that experiment. This can be
done essentially successfully if the repeated trials correpond to IID trials in the idealized
experiment. However, the statement about IID trials in the idealized experiment is a state-
ment about probabilities in the extended n-trial model. Thus, just as we tested Pr{A} by
repeated real-world trials of a single experiment, we should be able to test Pr{A
1
, . . . , A
n
}
in the n-repetition model by a much larger number of real-world repetitions of n-tuples
rather than single trials.
To be more speciﬁc, choose two large integers, m and n, and perform the underlying real-
world experiment mn times. Partition the mn trials into m runs of n trials each. For any
given n-tuple A
1
, . . . , A
n
of successive events, ﬁnd the relative frequency (over m trials of
n tuples) of the n-tuple event A
1
, . . . , A
n
. This can then be used essentially to test the
probability Pr{A
1
, . . . , A
n
} in the model for n IID trials. The individual event probabilities
can also be tested, so the condition for independence can be tested.
The observant reader will note that there is a tacit assumption above that successive n
tuples can be modeled as independent, so it seems that we are simply replacing a big
problem with a bigger problem. This is not quite true, since if the trials are dependent
with some given probability model for dependent trials, then this test for independence will
essentially reject the independence hypothesis for large enough n. In other words, we can
not completely verify the correctness of an independence hypothesis for the n-trial model,
although in principle we could eventually falsify it if it is false.
56 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
Choosing models for real-world experiments is primarily a subject for statistics, and we will
not pursue it further except for brief discussions when treating particular application areas.
The purpose here has been to treat a fundamental issue in probability theory. As stated
before, probabilities are non-observables — they exist in the theory but are not directly
measurable in real-world experiments. We have shown that probabilities essentially become
observable in the real-world via relative frequencies over repeated trials.
1.6.4 Limitations of relative frequencies
Most real-world applications that are modeled by probability models have such a large
sample space that it is impractical to conduct enough trials to choose probabilities from
relative frequencies. Even a shu✏ed deck of 52 cards would require many more than 52! ⇡
8 ⇥10
67
trials for most of the outcomes to appear even once. Thus relative frequencies can
be used to test the probability of given individual events of importance, but are usually
impractical for choosing the entire model and even more impractical for choosing a model
for repeated trials.
Since relative frequencies give us a concrete interpretation of what probability means, how-
ever, we can now rely on other approaches, such as symmetry, for modeling. From sym-
metry, for example, it is clear that all 52! possible arrangements of a card deck should be
equiprobable after shu✏ing. This leads, for example, to the ability to calculate probabilities
of di↵erent poker hands, etc., which are such popular exercises in elementary probability
classes.
Another valuable modeling procedure is that of constructing a probability model where the
possible outcomes are independently chosen n-tuples of outcomes in a simpler model. More
generally, most of the random processes to be studied in this text are deﬁned as various
ways of combining simpler idealized experiments.
What is really happening as we look at modeling increasingly sophisticated systems and
studying increasingly sophisticated models is that we are developing mathematical results
for simple idealized models and relating those results to real-world results (such as relating
idealized statistically independent trials to real-world independent trials). The association
of relative frequencies to probabilities forms the basis for this, but is usually exercised only
in the simplest cases.
The way one selects probability models of real-world experiments in practice is to use
scientiﬁc knowledge and experience, plus simple experiments, to choose a reasonable model.
The results from the model (such as the law of large numbers) are then used both to
hypothesize results about the real-world experiment and to provisionally reject the model
when further experiments show it to be highly questionable. Although the results about
the model are mathematically precise, the corresponding results about the real-world are
at best insightful hypotheses whose most important aspects must be validated in practice.
1.7. SUMMARY 57
1.6.5 Subjective probability
There are many useful applications of probability theory to situations other than repeated
trials of a given experiment. When designing a new system in which randomness (of the type
used in probability models) is hypothesized, one would like to analyze the system before
actually building it. In such cases, the real-world system does not exist, so indirect means
must be used to construct a probability model. Often some sources of randomness, such as
noise, can be modeled in the absence of the system. Often similar systems or simulation
can be used to help understand the system and help in formulating appropriate probability
models. However, the choice of probabilities is to a certain extent subjective.
Another type of situation (such as risk analysis for nuclear reactors) deals with a large
number of very unlikely outcomes, each catastrophic in nature. Experimentation clearly
cannot be used to establish probabilities, and it is not clear that probabilities have any
real meaning here. It can be helpful, however, to choose a probability model on the basis
of subjective beliefs which can be used as a basis for reasoning about the problem. When
handled well, this can at least make the subjective biases clear, leading to a more rational
approach. When handled poorly (as for example in some risk analyses of large ﬁnancial
systems) it can hide both the real risks and the arbitrary nature of possibly poor decisions.
We will not discuss the various, often ingenious, methods for choosing subjective proba-
bilities. The reason is that subjective beliefs should be based on intensive and long term
exposure to the particular problem involved; discussing these problems in abstract proba-
bility terms weakens this link. We will focus instead on the analysis of idealized models.
These can be used to provide insights for subjective models, and more reﬁned and precise
results for objective models.
1.7 Summary
This chapter started with an introduction into the correspondence between probability the-
ory and real-world experiments involving randomness. While almost all work in probability
theory works with established probability models, it is important to think through what
these probabilities mean in the real world, and elementary subjects rarely address these
questions seriously.
The next section discussed the axioms of probability theory, along with some insights about
why these particular axioms were chosen. This was followed by a review of conditional
probabilities, statistical independence, random variables, stochastic processes, and expec-
tations. The emphasis was on understanding the underlying structure of the ﬁeld rather
than reviewing details and problem solving techniques.
This was followed by discussing and developing the laws of large numbers at a somewhat
deeper level than most elementary courses. This involved a fair amount of abstraction,
combined with mathematical analysis. The central idea is that the sample average of n IID
rv’s approaches the mean with increasing n. As a special case, the relative frequency of
an event A approaches Pr{A}. What the word approaches means here is both tricky and
58 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
vital in understanding probability theory. The strong law of large numbers and convergence
WP1 requires mathematical maturity, and is postponed to Chapter 5 where it is ﬁrst used.
The ﬁnal section came back to the fundamental problem of understanding the relation
between probability theory and randomness in the real-world. It was shown, via the laws
of large numbers, that probabilities become essentially observable via relative frequencies
calculated over repeated experiments.
There are too many texts on elementary probability to mention here, and most of them
serve to give added understanding and background to the material in this chapter. We
recommend Bertsekas and Tsitsiklis [2], both for a careful statement of the fundamentals
and for a wealth of well-chosen and carefully explained examples.
Texts that cover similar material to that here are [18] and [12]. Kolmogorov [15] is readable
for the mathematically mature and is also of historical interest as the translation of the 1933
book that ﬁrst put probability on a ﬁrm mathematical basis. Feller [7] is the classic extended
and elegant treatment of elementary material from a mature point of view. Rudin [19] is
an excellent text on measure theory for those with advanced mathematical preparation.
1.8. EXERCISES 59
1.8 Exercises
Exercise 1.1. Consider a sequence A
1
, A
2
, . . . of events each of which have probability
zero.
a) Find Pr{

m
n=1
A
n
} and ﬁnd lim
m!1
Pr{

m
n=1
A
n
}. What you have done is to show
that the sum of a countably inﬁnite set of numbers each equal to 0 is perfectly well deﬁned
as 0.
b) For a sequence of possible phases, a
1
, a
2
, . . . between 0 and 2⇡, and a sequence of single-
ton events, A
n
= {a
n
}, ﬁnd Pr{

n
A
n
} assuming that the phase is uniformly distributed.
c) Now let each A
n
be the empty event ;. Use (1.1) and part a) to show that Pr{;} = 0.
Exercise 1.2. Let A
1
and A
2
be arbitrary events and show that Pr{A
1

A
2
}+Pr{A
1
A
2
} =
Pr{A
1
} + Pr{A
2
}. Explain which parts of the sample space are being double counted on
both sides of this equation and which parts are being counted once.
Exercise 1.3. This exercise derives the probability of an arbitrary (non-disjoint) union of
events, derives the union bound, and derives some useful limit expressions.
a) For 2 arbitrary events A
1
and A
2
, show that
A
1
_
A
2
= A
1
_
(A
2
A
1
) where A
2
A
1
= A
2
A
c
1
.
Show that A
1
and A
2
A
1
are disjoint Hint: This is what Venn diagrams were invented
for.
b) For an arbitrary sequence of events, {A
n
; n 1}, let B
1
= A
1
and for each n 2 deﬁne
B
n
= A
n

n1
m=1
A
m
. Show that B
1
, B
2
, . . . , are disjoint events and show that for each
n 2,

1
n=1
Pr{B
n
} .
Hint: Use the axioms of probability for the second equality.
d) Show that for each n, Pr{B
n
}  Pr{A
n
}. Use this to show that
Pr
_
_
1
n=1
A
n
_


1
n=1
Pr{A
n
} .
e) Show that Pr{

1
n=1
A
n
} = lim
m!1
Pr{

m
n=1
A
n
}. Hint: Combine parts c) and b). Note
that this says that the probability of a limit of unions is equal to the limit of the probabilities.
This might well appear to be obvious without a proof, but you will see situations later where
similar appearing interchanges cannot be made.
f ) Show that Pr{

1
n=1
A
n
} = lim
n!1
Pr{

n
i=1
A
i
}. Hint: Remember De Morgan’s equali-
ties.
60 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
Exercise 1.4. Find the probability that a ﬁve card poker hand, chosen randomly from a
52 card deck, contains 4 aces. That is, if all 52! arrangements of a deck of cards are equally
likely, what is the probability that all 4 aces are in the ﬁrst 5 cards of the deck.
Exercise 1.5. Consider a sample space of 8 equiprobable sample points and let A
1
, A
2
, A
3
be three events each of probability 1/2 such that Pr{A
1
A
2
A
3
} = Pr{A
1
} Pr{A
2
} Pr{A
3
}.
a) Create an example where Pr{A
1
A
2
} = Pr{A
1
A
3
} =
1
4
but Pr{A
2
A
3
} =
1
8
. Hint: Make
a table with a row for each sample point and a column for each event and try di↵erent ways
of assigning sample points to events (the answer is not unique).
b) Show that, for your example, A
2
and A
3
are not independent. Note that the deﬁnition
of statistical independence would be very strange if it allowed A
1
, A
2
, A
3
to be independent
while A
2
and A
3
are dependent. This illustrates why the deﬁnition of independence requires
(1.14) rather than just (1.15).
Exercise 1.6. This exercise shows that for all rv’s X, F
X
(x) is continuous from the right.
a) For any given rv X, any real number x, and each integer n 1, let A
n
= {. : X >
x + 1/n}, and show that A
1
✓ A
2
✓ · · · . Use this and the corollaries to the axioms of
probability to show that Pr
_

n1
A
n
_
= lim
n!1
Pr{A
n
}.
b) Show that Pr
_

n1
A
n
_
= Pr{X > x} and that Pr{X > x} = lim
n!1
Pr{X > x + 1/n}.
c) Show that for ✏ > 0, lim
✏!0
Pr{X  x +✏} = Pr{X  x}.
d) Deﬁne
¯
F
X
(x) = Pr{X < x}. Show that
¯
F
X
(x) is continuous from the left. In other words,
the continuity from the right for the distribution function arises from the almost arbitrary
(but universally accepted) choice in deﬁning the distribution function as Pr{X  x} rather
than Pr{X < x}.
Exercise 1.7. Show that for a continuous non-negative rv X,
_
1
0
Pr{X > x} dx =
_
1
0
xf
X
(x) dx (1.101)
Hint 1: First rewrite Pr{X > x} on the left side of (1.101) as
_
1
x
f
X
(y) dy. Then think
through, to your level of comfort, how and why the order of integration can be interchnged
in the resulting expression.
Hint 2: As an alternate approach, derive (1.101) using integration by parts.
Exercise 1.8. Suppose X and Y are discrete rv’s with the PMF p
XY
(x
i
, y
j
). Show (a
picture will help) that this is related to the joint distribution function by
p
XY
(x
i
, y
j
) = lim
>0,!0
[F(x
i
, y
j
) F(x
i
c, y
j
) F(x
i
, y
j
c) +F(x
i
c, y
j
c)] .
1.8. EXERCISES 61
Exercise 1.9. A variation of Example 1.3.1 is to let M be a random variable that takes
on both positive and negative values with the PMF
p
M
(m) =
1
2|m| (|m| + 1)
.
In other words, M is symmetric around 0 and |M| has the same PMF as the nonnegative
rv N of Example 1.3.1.
a) Show that

m0
mp
M
(m) = 1 and

m<0
mp
M
(m) = 1. (Thus show that the
expectation of M not only does not exist but is undeﬁned even in the extended real number
system.)
b) Suppose that the terms in

1
m=1
mp
M
(m) are summed in the order of 2 positive terms
for each negative term (i.e., in the order 1, 2, 1, 3, 4, 2, 5, · · · ). Find the limiting value of
the partial sums in this series. Hint: You may ﬁnd it helpful to know that
lim
n!1
_

n
i=1
1
i

_
n
1
1
x
dx
_
= ¸,
where ¸ is the Euler-Mascheroni constant, ¸ = 0.57721 · · · .
c) Repeat part b) where, for any given integer k > 0, the order of summation is k positive
terms for each negative term.
Exercise 1.10. Let X be a ternary rv taking on the 3 values 0, 1, 2 with probabilities
p
0
, p
1
, p
2
respectively. Find the median of X for each of the cases below.
a) p
0
= 0.2, p
1
= 0.4, p
2
= 0.4.
b) p
0
= 0.2, p
1
= 0.2, p
2
= 0.6.
c) p
0
= 0.2, p
1
= 0.3, p
2
= 0.5.
Note 1: The median is not unique in part c). ﬁnd the interval of values that are medians.
Note 2: Some people force the median to be distinct by deﬁning it as the midpoint of the
interval satisfying the deﬁnition given here.
d) Now suppose that X is non-negative and continuous with the density f
X
(x) = 1 for
0  x  0.5 and f
X
(x) = 0 for 0.5 < x  1. We know that f
X
(x) is positive for all x > 1,
but it is otherwise unknown. Find the median or interval of medians.
The median is sometimes (incorrectly) deﬁned as that ↵ for which Pr{X > ↵} = Pr{X < ↵}.
Show that it is possible for no such ↵ to exist. Hint: Look at the examples above.
Exercise 1.11. a) For any given rv Y , express E[|Y |] in terms of
_
y<0
F
Y
(y) dy and
_
y0
F
c
Y
(y) dy. Hint: Review the argument in Figure 1.4.
b) For some given rv X with E[|X|] < 1, let Y = X ↵. Using part a), show that
E[|X ↵|] =
_
1
↵
F
X
(x) dx +
_
↵
1
F
c
X
(x) dx.
62 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
c) Show that E[|X ↵|] is minimized over ↵ by choosing ↵ to be a median of X. Hint:
Both the easy way and the most instructive way to do this is to use a graphical argument
involving shifting Figure 1.4. Be careful to show that when the median is an interval, all
points in this interval achieve the minimum.
Exercise 1.12. Let X be a rv with distribution function F
X
(x). Find the distribution
function of the following rv’s.
a) The maximum of n IID rv’s, each with distribution function F
X
(x).
b) The minimum of n IID rv’s, each with distribution F
X
(x).
c) The di↵erence of the rv’s deﬁned in a) and b); assume X has a density f
X
(x).
Exercise 1.13. Let X and Y be rv’s in some sample space ⌦ and let Z = X +Y , i.e., for
each . 2 ⌦, Z(.) = X(.) +Y (.).
a) Show that the set of . for which Z(.) = ±1 has probability 0.
b) To show that Z = X+Y is a rv, we must show that for each real number ↵, the set {. 2
⌦ : X(.) +Y (.)  ↵} is an event. We proceed indirectly. For an arbitrary positive integer
n and an arbitrary integer k, let B(n, k) = {. : X(.)  k↵/n}

{Y (.)  (n +1 k)↵/n}.
Let D(n) =

k
B(n, k) and show that D(n) is an event.
c) On a 2 dimensional sketch for a given ↵, show the values of X(.) and Y (.) for which
. 2 D(n). Hint: This set of values should be bounded by a staircase function.
d) Show that
{. : X(.) +Y (.)  ↵} =

n
D(n).
Explain why this shows that Z = X +Y is a rv.
e) Explain why this implies that if Y = X
1
+X
2
+· · · +X
n
and if X
1
, X
2
, . . . , X
n
are rv’s,
then Y is a rv. Hint: Only one or two lines of explanation are needed.
Exercise 1.14. a) Let X
1
, X
2
, . . . , X
n
be rv’s with expected values X
1
, . . . , X
n
. Show
that E[X
1
+ · · · +X
n
] = X
1
+· · · +X
n
. You may assume that the rv’s have a joint density
function, but do not assume that the rv’s are independent.
b) Now assume that X
1
, . . . , X
n
are statistically independent and show that the expected
value of the product is equal to the product of the expected values.
c) Again assuming that X
1
, . . . , X
n
are statistically independent, show that the variance
of the sum is equal to the sum of the variances.
Exercise 1.15. (Stieltjes integration) a) Let h(x) = u(x) and F
X
(x) = u(x) where u(x) is
the unit step, i.e., u(x) = 0 for 1< x < 0 and u(x) = 1 for x 0. Using the deﬁnition of
the Stieltjes integral in Footnote 24, show that
_
1
1
h(x)dF
X
(x) does not exist. Hint: Look
1.8. EXERCISES 63
at the term in the Riemann sum including x = 0 and look at the range of choices for h(x) in
that interval. Intuitively, it might help initially to view dF
X
(x) as a unit impulse at x = 0.
b) Let h(x) = u(x a) and F
X
(x) = u(x b) where a and b are in (1, +1). Show that
_
1
1
h(x)dF
X
(x) exists if and only if a 6= b. Show that the integral has the value 1 for a < b
and the value 0 for a > b. Argue that this result is still valid in the limit of integration over
(1, 1).
c) Let X and Y be independent discrete rv’s, each with a ﬁnite set of possible values. Show
that
_
1
1
F
X
(z y)dF
Y
(y), deﬁned as a Stieltjes integral, is equal to the distribution of
Z = X +Y at each z other than the possible sample values of Z, and is undeﬁned at each
sample value of Z. Hint: Express F
X
and F
Y
as sums of unit steps. Note: This failure of
Stieltjes integration is not a serious problem; F
Z
(z) is a step function, and the integral is
undeﬁned at its points of discontinuity. We automatically deﬁne F
Z
(z) at those step values
so that F
Z
is a distribution function (i.e., is continuous from the right). This problem does
not arise if either X or Y is continuous.
Exercise 1.16. Let X
1
, X
2
, . . . , X
n
, . . . be a sequence of IID continuous rv’s with the
common probability density function f
X
(x); note that Pr{X=↵} = 0 for all ↵ and that
Pr{X
i
=X
j
} = 0 for all i 6= j. For n 2, deﬁne X
n
as a record-to-date of the sequence if
X
n
> X
i
for all i < n.
a) Find the probability that X
2
is a record-to-date. Use symmetry to obtain a numerical
answer without computation. A one or two line explanation should be adequate.
b) Find the probability that X
n
is a record-to-date, as a function of n 1. Again use
symmetry.
c) Find a simple expression for the expected number of records-to-date that occur over
the ﬁrst m trials for any given integer m. Hint: Use indicator functions. Show that this
expected number is inﬁnite in the limit m !1.
Exercise 1.17. (Continuation of Exercise 1.16)
a) Let N
1
be the index of the ﬁrst record-to-date in the sequence. Find Pr{N
1
> n} for
each n 2. Hint: There is a far simpler way to do this than working from part b) in
Exercise 1.16.
b) Show that N
1
is a rv.
c) Show that E[N
1
] = 1.
d) Let N
2
be the index of the second record-to-date in the sequence. Show that N
2
is a rv.
Hint: You need not ﬁnd the distribution function of N
2
here.
e) Contrast your result in part c) to the result from part c) of Exercise 1.16 saying that the
expected number of records-to-date is inﬁnite over an an inﬁnite number of trials. Note:
this should be a shock to your intuition — there is an inﬁnite expected wait for the ﬁrst of
an inﬁnite sequence of occurrences, each of which must eventually occur.
64 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
Exercise 1.18. (Another direction from Exercise 1.16) a) For any given n 2, ﬁnd the
probability that X
n
and X
n+1
are both records-to-date. Hint: The idea in part b) of 1.16
is helpful here, but the result is not.
b) Is the event that X
n
is a record-to-date statistically independent of the event that X
n+1
is a record-to-date?
c) Find the expected number of adjacent pairs of records-to-date over the sequence X
1
, X
2
, . . . .
Hint: A helpful fact here is that
1
n(n+1)
=
1
n

1
n+1
.
Exercise 1.19. a) Assume that X is a nonnegative discrete rv taking on values a
1
, a
2
, . . . ,
and let Y = h(X) for some nonnegative function h. Let b
i
= h(a
i
), i1 be the i
th
value
taken on by Y . Show that E[Y ] =

n1
h(nc)[F(nc) F(nc c)],
i.e., A(c) is a cth order approximation to the Stieltjes integral
_
h(x)dF(x). Show that if
A(1) < 1, then A(2
k
)  A(2
(k1)
) < 1. Show from this that
_
h(x) dF(x) converges
to a ﬁnite value. Note: this is a very special case, but it can be extended to many cases
of interest. It seems better to consider these convergence questions as required rather than
consider them in general.
Exercise 1.20. a) Consider a positive, integer-valued rv whose distribution function is
given at integer values by
F
Y
(y) = 1
2
(y + 1)(y + 2)
for integer y 0.
Use (1.33) to show that E[Y ] = 2. Hint: Note the PMF given in (1.32).
b) Find the PMF of Y and use it to check the value of E[Y ].
c) Let X be another positive, integer-valued rv. Assume its conditional PMF is given by
p
X|Y
(x|y) =
1
y
for 1  x  y.
Find E[X | Y = y] and show that E[X] = 3/2. Explore ﬁnding p
X
(x) until you are con-
vinced that using the conditional expectation to calculate E[X] is considerably easier than
using p
X
(x).
d) Let Z be another integer-valued rv with the conditional PMF
p
Z|Y
(z|y) =
1
y
2
for 1  z  y
2
.
Find E[Z | Y = y] for each integer y 1 and ﬁnd E[Z].
1.8. EXERCISES 65
Exercise 1.21. a) Show that, for uncorrelated rv’s, the expected value of the product is
equal to the product of the expected values (by deﬁnition, X and Y are uncorrelated if
E
⇥
(X X)(Y Y )
⇤
= 0).
b) Show that if X and Y are uncorrelated, then the variance of X + Y is equal to the
variance of X plus the variance of Y .
c) Show that if X
1
, . . . , X
n
are uncorrelated, the the variance of the sum is equal to the
sum of the variances.
d) Show that independent rv’s are uncorrelated.
e) Let X, Y be identically distributed ternary valued random variables with the PMF
p
X
(1) = p
X
(1) = 1/4; p
X
(0) = 1/2. Find a simple joint probability assignment such
that X and Y are uncorrelated but dependent.
f ) You have seen that the moment generating function of a sum of independent rv’s is equal
to the product of the individual moment generating functions. Give an example where this
is false if the variables are uncorrelated but dependent.
Exercise 1.22. Suppose X has the Poisson PMF, p
X
(n) = `
n
exp(`)/n! for n 0 and
Y has the Poisson PMF, p
Y
(m) = µ
n
exp(µ)/n! for n 0. Assume that X and Y are
independent. Find the distribution of Z = X + Y and ﬁnd the conditional distribution of
Y conditional on Z = n.
Exercise 1.23. a) Suppose X, Y and Z are binary rv’s, each taking on the value 0 with
probability 1/2 and the value 1 with probability 1/2. Find a simple example in which
X, Y , Z are statistically dependent but are pairwise statistically independent (i.e., X, Y
are statistically independent, X, Z are statistically independent, and Y , Z are statistically
independent). Give p
XY Z
(x, y, z) for your example. Hint: In the simplest example, there
are four joint values for x, y, z that have probability 1/4 each.
b) Is pairwise statistical independence enough to ensure that
E
_

n
i=1
X
i
_
=

n
i=1
E[X
i
] .
for a set of rv’s X
1
, . . . , X
n
?
Exercise 1.24. Show that E[X] is the value of ↵ that minimizes E
⇥
(X ↵)
2
⇤
.
Exercise 1.25. For each of the following random variables, ﬁnd the interval (r

, r
+
) over
which the moment generating function g(r) exists. Determine in each case whether g
X
(r)
exists at the end points r

, r
+
). Explain why g
00
(Xc)
(r) 0.
b) Show that g
00
(Xc)
(r) = [g
00
X
(r) 2cg
0
X
(r) +c
2
g
X
(r)]e
rc
.
c) Use a) and b) to show that g
00
X
(r)g
X
(r) [g
0
X
(r)]
2
0, Let ¸
X
(r) = lng
X
(r) and show
that ¸
00
X
(r) 0. Hint: Choose c = g
0
X
(r)/g
X
(r).
d) Assume that X is non-deterministic, i.e., that there is no value of ↵ such that Pr{X = ↵} =
1. Show that the inequality sign “” may be replaced by “>” everywhere in a), b) and c).
Exercise 1.28. A computer system has n users, each with a unique name and password.
Due to a software error, the n passwords are randomly permuted internally (i.e. each of
the n! possible permutations are equally likely. Only those users lucky enough to have had
their passwords unchanged in the permutation are able to continue using the system.
a) What is the probability that a particular user, say user 1, is able to continue using the
system?
b) What is the expected number of users able to continue using the system? Hint: Let X
i
be a rv with the value 1 if user i can use the system and 0 otherwise.
1.8. EXERCISES 67
Exercise 1.29. Suppose the rv X is continuous and has the distribution function F
X
(x).
Consider another rv Y = F
X
(X). That is, for each sample point . such that X(.) = x, we
have Y (.) = F
X
(x). Show that Y is uniformly distributed in the interval 0 to 1.
Exercise 1.30. Let Z be an integer-valued rv with the PMF p
Z
(n) = 1/k for 0  n  k1.
Find the mean, variance, and moment generating function of Z. Hint: An elegant way to
do this is to let U be a uniformly distributed continuous rv over (0, 1] that is independent
of Z. Then U +Z is uniform over (0, k]. Use the known results about U and U +Z to ﬁnd
the mean, variance, and MGF for Z.
Exercise 1.31. (Alternate approach 1 to the Markov inequality) a) Let Y be a nonnegative
rv and y > 0 be some ﬁxed number. Let A be the event that Y y. Show that y I
A
 Y
(i.e., that this inequality is satisﬁed for every . 2 ⌦).
b) Use your result in part a) to prove the Markov inequality.
Exercise 1.32. (Alternate approach 2 to the Markov inequality) a) Minimize E[Y ] over
all non-negative rv’s such that Pr{Y b} u for some given b > 0 and 0 < u < 1. Hint:
Use a graphical argument similar to that in Figure 1.7. What is the rv that achieves the
minimum. Hint: It is binary.
b) Use part a) to prove the Markov inequality and also point out the distribution that
meets the inequality with equality.
Exercise 1.33. The Borel-Cantelli lemma says that if {A
i
; i 1} is a sequence of events
and if

1
i=1
Pr{A
i
} < 1, then there is zero probability that an inﬁnite number of those
events occur. To state the lemma more cleanly, let I
A
i
be the indicator rv of A
i
. Thus

i
I
A
i
is the number of events that occur. The Borel-Cantelli lemma then states that
If
1

i=1
Pr{A
i
} < 1 then lim
m!1
Pr
_
1

i=1
I
A
i
> m
_
= 0.
a) Recalling that E[I
A
] = Pr{A}, verify that
1

i=1
Pr{A
i
} =
1

i=1
E[I
A
i
] = E
_
1

i=1
I
A
i
_
.
b) Show that for each m > 0,
Pr
_
1

i=1
I
A
i
m
_

E[

1
i=1
I
A
i
]
m
.
Taking the limit m ! 1 proves the lemma. Part b) show slightly more than what the
lemma states by showing that the probability of more than m occurrences goes to 0 as 1/m.
See Exercise 5.8 for a di↵erent, perhaps less useful, way of expressing the event of inﬁnitely
many occurrences of the A
i
.
68 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
Exercise 1.34. (The one-sided Chebyshev inequality) This inequality states that if a zero-
mean rv X has a variance o
2
, then it satisﬁes the inequality
Pr{X b} 
o
2
o
2
+b
2
for every b > 0, (1.102)
with equality for some b only if X is binary and Pr{X = b} = o
2
/(o
2
+b
2
). We prove this
here using the same approach as in Exercise 1.32. Let X be a zero-mean rv that satisﬁes
Pr{X b} = u for some b > 0 and 0 < u < 1. The variance o
2
of X can be expressed as
o
2
=
_
b

1
xf
X
(x) dx  bu.
c) Minimize the ﬁrst integral in (1.103) subject to the constraints in part b). Hint: If
you scale f
X
(x) up by 1/(1 u), it integrates to 1 over (1, b) and the second constraint
becomes an expectation. You can then minimize the ﬁrst integral in (1.103) by inspection.
d) Combine the results in a) and c) to show that o
2
b
2
u/(1 u). Find the minimizing
distribution. Hint: It is binary.
e) Use part d) to establish (1.102). Also show (trivially) that if Y has a mean Y and
variance o
2
, then Pr
_
Y Y b
_
 o
2
/(o
2
+b
2
)
Exercise 1.35. (Proof of (1.51)) Here we show that if X is a zero-mean rv with a variance
o
2
, then the median ↵ satisﬁes |↵|  o.
a) First show that |↵|  o for the special case where X is binary with equiprobable values
at ±o.
b) For all zero-mean rv’s X with variance o
2
other than the special case in a), show that
Pr{X o} < 0.5.
Hint: Use the one-sided Chebyshev inequality of Exercise 1.34.
c) Show that Pr{X ↵} 0.5. Other than the special case in a), show that this implies
that ↵ < o.
d) Other than the special case in a), show that |↵| < o. Hint: repeat b) and c) for the rv
X. You have then shown that |↵|  o with equality only for the binary case with values
±o. For rv’s Y with a non-zero mean, this shows that |↵ Y |  o.
1.8. EXERCISES 69
Exercise 1.36. We stressed the importance of the mean of a rv X in terms of its association
with the sample average via the WLLN. Here we show that in essence the WLLN allows us to
evaluate the entire distribution function, say F
X
(x) of X via suciently many independent
sample values of X.
a) For any given y, let I
j
(y) be the indicator function of the event {X
j
 y} where
X
1
, X
2
, . . . , X
j
, . . . are IID rv’s with the distribution function F
X
(x). State the WLLN
for the IID rv’s {I
1
(y), I
2
(y), . . . }.
b) Does the answer to part a) require X to have a mean or variance?
c) Suggest a procedure for evaluating the median of X from the sample values of X
1
, X
2
, . . . .
Assume that X is a continuous rv. You need not be precise, but try to think the issue
through carefully.
What you have seen here, without stating it precisely or proving it is that the median has
a law of large numbers associated with it, saying that the sample median of n IID samples
of a rv is close to the true median with high probability.
Exercise 1.37. a) Show that for any 0 < k < n
✓
n
k + 1
◆

✓
n
k
◆
n k
k
.
b) Extend part a) to show that, for all /  n k,
✓
n
k +/
◆

✓
n
k
◆_
n k
k
_
`
.
c) Let ˜ p = k/n and ˜ q = 1 ˜ p. Let S
n
be the sum of n binary IID rv’s with p
X
(0) = q and
p
X
(1) = p. Show that for all /  n k,
p
Sn
(k +/)  p
Sn
(k)
✓
˜ qp
˜ pq
◆
`
.
d) For k/n > p, show that Pr{S
n
kn} 
˜ pq
˜ pp
p
Sn
(k).
e) Now let / be ﬁxed and k = dn˜ pe for ﬁxed ˜ p such that 1 > ˜ p > p. Argue that as n !1,
p
Sn
(k +/) ⇠ p
Sn
(k)
✓
˜ qp
˜ pq
◆
`
and Pr{S
n
kn} ⇠
˜ pq
˜ p p
p
Sn
(k),
where a(n) ⇠ b(n) means that lim
n!1
a(n)/b(n) = 1.
Exercise 1.38. A sequence {a
n
; n 1} of real numbers has the limit 0 if for all ✏ > 0,
there is an m(✏) such that |a
n
|  ✏ for all n m(✏). Show that the sequences in parts a)
and b) below satisfy lim
n!1
a
n
= 0 but the sequence in part c) does not have a limit.
a) a
n
=
1
ln(ln(n+1))
70 CHAPTER 1. INTRODUCTION AND REVIEW OF PROBABILITY
b) a
n
= n
10
exp(n)
c) a
n
= 1 for n = 10
`
for each positive integer / and a
n
= 0 otherwise.
d) Show that the deﬁnition can be changed (with no change in meaning) by replacing ✏
with either 1/k or 2
k
for every positive integer k.
Exercise 1.39. Consider the moment generating function of a rv X as consisting of the
following two integrals:
g
X
(r) =
_
0
1
e
rx
dF(x) +
_
1
0
e
rx
dF(x).
In each of the following parts, you are welcome to restrict X to be either discrete or con-
tinuous.
a) Show that the ﬁrst integral always exists (i.e., is ﬁnite) for r 0 and that the second
integral always exists for r  0.
b) Show that if the second integral exists for a given r
1
> 0, then it also exists for all r in
the range 0  r  r
1
.
c) Show that if the ﬁrst integral exists for a given r
2
< 0, then it also exists for all r in the
range r
2
 r  0.
d) Show that the range of r over which g
X
(r) exists is an interval from some r
2
 0 to
some r
1
0 (the interval might or might not include each endpoint, and either or both end
point might be 0 or 1).
e) Find an example where r
1
= 1 and the MGF does not exist for r = 1. Find another
example where r
1
= 1 and the MGF does exist for r = 1. Hint: Consider f
X
(x) = e
x
for x
0 and ﬁgure out how to modify it to f
Y
(y) so that
_
1
0
e
y
f
Y
(y) dy < 1but
_
1
0
e
y+✏y
f
Y
(y) =
1 for all ✏ > 0.
Exercise 1.40. Let {X
n
; n 1} be a sequence of independent but not identically dis-
tributed rv’s. We say that the weak law of large numbers (WLLN) holds for this sequence
if for all ✏ > 0
lim
n!1
Pr
⇢
¸
¸
¸
S
n
n