Probability: Axioms and Fundaments

discussed how the mathematical theory of probability is connected to the world
through philosophical
theories of probability.
reviewed the basic tool needed to discuss probability
mathematically, Set Theory.
This chapter introduces the mathematical theory of probability,
in which probability is a function that assigns numbers between
0 and 100% to events, subsets of
outcome space.
Starting with just three axioms and a few definitions, the mathematical
theory develops powerful and beautiful consequences.
The chapter presents the
axioms of probability
and some consequences of the axioms.
Conditional probability
is then defined, which leads to two useful
formulae—the Multiplication Rule
and Bayes' Rule—and to the
definition of independence.
All these concepts and formulae play important roles in the sequel.

The Axioms of Probability

The axioms of probability are mathematical rules that probability must satisfy.
Let A and B be events.
Let P(A) denote the probability of the event A.
The axioms of probability are these three conditions on the function P:

The probability of every event is at least zero.
(For every event A, P(A) ≥ 0.
There is no such thing as a negative probability.)

The probability of the entire outcome space is 100%.
(P(S) = 100%.
The chance that something in the outcome space occurs is 100%,
because the outcome space contains every possible outcome.)

If two events are disjoint, the
probability that either of the events happens is the sum of the
probabilities that each happens.
(If AB = {},
P(A ∪ B) =
P(A) + P(B).)

Both axiom 3 and axiom 3' hold for every probability function used in
this book.
Any function P that assigns numbers to subsets of the outcome space
S and satisfies the Axioms of Probability is called a
probability distribution onS.

Let S be a set containing
n>0 elements, for example,
S= {1, 2, … , n}.
For any subset A of S,
define #A
to be the number of elements of A.
For example, #{} = 0, #{1, 2} = 2, and
#{n, n−1, n−2} = 3.
The function # is called the cardinality function and
#A is called the
cardinality ofA.

The cardinality of a finite set is the number of elements it contains, so
in this example, where S = {1, 2, 3, … , n},
#S = n.

Let P(A) = #A/n, the number of elements in the
subset A, divided by the total number of elements in S.
Then the function P is called
the uniform probability distribution onS.
The function P satisfies the axioms of probability.
Let us see why.

The number of elements in any subset A of
S is at least
zero (#A≥0), so P(A) ≥ 0/n = 0.
Thus P satisfies Axiom 1.

P(S) =
#S/n = n/n = 100%.
Thus P satisfies Axiom 2.

If A and B are disjoint,
then the number of elements in the
union A∪B
is the number of elements in A plus the number
of elements in B:

#(A∪B)
= #A + #B.

Therefore,

P(A∪B) =
#(A∪B)/n =
(#A + #B)/n = #A/n +
#B/n = P(A)
+ P(B).

Thus P satisfies Axiom 3.

We shall use the uniform probability distribution very often.
For example, we shall use the uniform probability distribution on
the outcome space S = {0, 1} to model the number of heads in a
single toss of a fair coin.
We shall use the uniform probability distribution on the outcome
space S = {1, 2, … , 6} to model the number of spots that
show on the top face of a fair die when it is rolled.
We shall use the uniform probability distribution on the outcome
space S of the 36 pairs

{(i, j): i = 1, 2, … , 6 and j = 1, 2, … , 6}

to model rolls of a fair pair of dice.
We shall use the uniform probability distribution on the outcome
space S of all 52! permutations of a deck of cards to model
shuffling the deck well.
We shall use the uniform probability distribution to model
drawing a ticket from a well-stirred box of numbered tickets;
in that case, the outcome space S is the collection
of numbers written on the tickets (including duplicates
as often as they occur on the tickets).
The uniform probability distribution is the same as the
distribution postulated by the
Theory of Equally Likely Outcomes
(if the outcomes are defined suitably).

Consider a random trial that can result in failure or success.
Let 0 stand for failure, and let 1 stand for success.
Then we can consider the outcome space to be S = {0, 1}.
For any number p between 0 and 100%, define the
function P as follows:

P({1}) = p,

P({0}) = 100% − p,

P(S) = 100%,

P({}) = 0.

Then P is a probability distribution on
S, as we can verify
by checking that it satisfies the axioms:

Because p is between 0 and 100%, so is
100% − p.
The outcome space S has but four subsets:
{}, {0}, {1}, and {0, 1}.
The values assigned to them by P are
0, 1 − p, p, and 100%,
respectively.
All these numbers are at zero or larger, so P satisfies Axiom 1.

By definition, P(S) = 100%, so
P satisfies Axiom 2.

The empty set and any other set are disjoint, and it is easy to
see that

P({}∪A) =
P({}) + P(A) for any subset A of S.

The only other pair of disjoint events in S
is {0} and {1}.
We can calculate

P({0}∪{1}) = P(S)
= 100% = (100% − p) + p = P({0}) + P({1}).

Thus P satisfies Axiom 3.

In later chapters this probability distribution will be the building
block for more complex distributions involving sequences of trials.

Consequences of the Axioms of Probability

Everything that is mathematically true of probability is a consequence
of the Axioms of Probability, and of further definitions.
For example, if S is
countable—that is,
if its elements can be put into 1:1 correspondence with a subset of the
integers—the sum of the probabilities of the elements of
S must be 100%.
This follows from Axioms 2 and 3': Axiom 3' tells us that because the
elements of SpartitionS, the
probability of S is the sum of the probabilities
of the elements of S. Axiom 2 tells us that that sum must be 100%.

The Complement Rule

Another consequence of the axioms is the Complement Rule: The probability
that an event occurs is always equal to 100% minus the probability
that the event does not occur:

P(Ac) = 100% − P(A).

The Complement Rule is extremely useful, because in many problems
it is much easier to calculate the probability that A does not
occur than to calculate the probability that A does occur.
The complement rule can be derived from the axioms: the union of A
and its complement Ac is
S
(either A happens or it does not,
and there is no other possibility), so

P(A∪Ac)
= P(S) = 100%,

by axiom 2.
The event A and its complement are disjoint (if
"A does not happen" happens,
A does not happen;
if A happens, "A
does not happen" does not happen), so

P(A∪Ac) =
P(A) + P(Ac)

by axiom 3.
Putting these together, we get

P(A) + P(Ac) = 100%.

Subtracting P(A) from both sides of this equation yields
what we sought:

P(Ac) = 100%-P(A).

Consider tossing a fair coin 10 times in such a manner that every
sequence of 10 heads and/or tails is equally likely.
What is the probability that the coin lands heads at least once?

This would be quite difficult to calculate directly, because there
are very many ways in which the coin can land heads at least once.
However, there is only one way the coin can fail to land heads at least once:
All the tosses must yield tails.
That makes it easy to calculate the probability that the coin lands
heads at least once, using the Complement Rule.

Every sequence of heads and tails is equally likely, by assumption:
The probability distribution is the uniform distribution on sequences
of 10 heads and/or tails, so the probability of any particular
sequence is 100%/(total number of sequences).
By the Fundamental Rule of Counting,
there are

2×2× … ×2 = 210 = 1,024

sequences of 10 heads and tails.

One of those sequences is (tails, tails, … , tails), so the probability that
the coin lands tails in all 10 tosses is

100%/210 = 0.0977%.

By the complement rule, the probability that the coin lands heads at
least once is therefore

100% − 0.0977% = 99.902%.

A special case of the Complement Rule is that the probability of the empty set
is always zero (P({}) = 0%), because
P(S) = 100%, and
Sc= {}.

An event A whose probability is 100% is said to be
certain or sure.
S is certain.

The Probability of the Union of Two Events

The third Axiom of Probability tells us how to find the probability of a
union of disjoint events in terms of their individual probabilities.
The Axioms can be used together to find a formula for the probability
of a union of two events that are not necessarily disjoint in
terms of the probability of each of the events and the probability of
their intersection.

Probability is analogous to area or volume or mass.
Consider the unit square, each of whose sides has length 1.
Its total area is 1×1 = 1 = 100%.
Let's call the square S, just like outcome space.
Now consider regions inside the square S
(subsets
of S).
The area of any such region is at least zero, the area of S is
100%, and the area of the union of two regions is the sum of
their areas, if they do not overlap (i.e., if they are
disjoint).
These facts are direct analogues of the axioms of probability,
and we shall often use this model to get intuition about probability.

It might help your intuition to consider the square S to be a
dartboard.
The experiment consists of throwing a dart at the board once.
The event A occurs if the dart sticks in the set A.
The event AB occurs if the dart sticks in both
A and B on that one toss.
Clearly, AB cannot occur unless A and
B overlap—the
dart cannot stick in two places at once.
A∪B occurs if the dart
sticks in either A or B (or both) on that
one throw.
A and B need not overlap for
A∪B to occur.

This analogy is also useful for thinking about the connection between
Set Theory and logical implication.
If A is a subset of B, the
occurrence of Aimplies
the occurrence of B; We shall sometimes say that A
implies B.
In the dartboard model, the dart cannot stick in A
without sticking in B as well, so if
A occurs, B must occur also.
If A implies B, AB=A,
so P(AB)=P(A).
If AB = {},
A implies Bc and
B implies Ac: If the dart sticks
in A it did not stick in B, and vice versa.
If A implies B,
then if B does not occur A cannot
occur either:
Bc implies Ac,
so Bc is a subset of Ac.

The following exercises test your understanding of the axioms of
probability and their consequences.

Videos of Exercises

(Reminder: Examples and exercises may vary when the page is reloaded; the video shows only one version.)

Conditioning

In probability, conditioning means incorporating new
restrictions on the outcome of an experiment:
updating probabilities to take into account new information.
This section describes conditioning, and how conditional
probability can be used to solve complicated problems.

Conditional Probability

The conditional probability of A given B,
P(A | B),
is the probability of the
event A, updated on the basis of the knowledge that the event B
occurred.
Suppose that AB = {}
(A and B are disjoint).
Then if we learn that B occurred, we know A did not occur,
so we should revise the probability of A to be zero
(the conditional probability of A given B is zero).
On the other hand, suppose that AB = B
(B is a subset of A, so B implies A).
Then if we learn that B occurred, we know A must have
occurred as well, so we should revise the probability of
A to be 100% (the conditional probability of A given B is 100%).
For in-between cases, the conditional probability of A given B
is defined to be

P(AB)

P(A | B)

=

------------ ,

P(B)

provided P(B) is not zero (division by zero is undefined).
"P(A | B)" is pronounced "the
(conditional) probability of A given B."

Why does this formula make sense?
First of all, note that it does agree with the intuitive answers we
found above: if AB = {}, then
P(AB) = 0, so

P(A | B)
= 0/P(B) = 0;

and if AB = B,

P(A | B) = P(B)/P(B) = 100%.

Similarly, if we learned that S occurred, this is not really new
information (by definition, S always occurs, because it
contains all possible outcomes), so we would like
P(A | S) to equal P(A).
That is how it works out:
A<S = A, so

P(A | S) = P(A)/P(S) =
P(A)/100% = P(A).

Now suppose that A and B are not disjoint.
Then if we learn that B occurred, we can restrict attention
to just those outcomes that are in B, and disregard the
rest of S, so we have a new outcome space that is just B.
We need P(B) = 100% to consider B an outcome space; we
can make this happen by dividing all probabilities by P(B).
For A to have occurred in addition to
B requires that AB occurred,
so the conditional probability of A given B is
P(AB)/P(B),
just as we defined it above.

We shall deal two cards from a well shuffled deck.
What is the conditional probability that the second card is an Ace
(event A), given that the first card is an Ace (event B)?

Solution.
By definition, this is P(AB)/P(B).
The (unconditional) chance that the first card is an Ace
is 100%/13 = 7.7%, because there are 13 possible
faces for the first card, and all are equally likely
(this is what we mean by a well well-shuffled deck).

The chance that both cards are Aces can be computed as follows:
From the four suits, we need to pick two; there are 4C2 = 6 ways
that can happen.
The total number of ways of picking two cards from the deck is
52C2 = 52×51/2 = 1326, so the chance that the two cards are
both Aces is (6/1326)×100% = 0.5%.
The conditional probability that the second card is an
Ace given that the first card is an Ace is thus
0.5%/7.7% = 5.9%.
As we might expect, it is somewhat lower than the chance that the
first card is an Ace, because we know one of the Aces is gone.

We could approach this more intuitively as well: Given that
the first card is an Ace, the second card is an Ace too if it is one of
the three remaining Aces among the 51 remaining cards.
These possibilities are equally likely if the deck was shuffled well,
so the chance is 3/51 × 100% = 5.9%.

Conditional probability behaves just like probability:
It satisfies the axioms of probability and all their consequences.
Thus, for example,

Independence

Two events are independent if learning that one occurred gives us
no information about whether the other occurred.
That is, A and B are independent
if P(A | B) = P(A) and
P(B | A) = P(B).
A slightly more general way to write this is that
A and B are independent
if P(AB) = P(A)×P(B).
(This covers the cases that P(A),
P(B) or both are equal to zero,
while the definition of independence in terms of conditional probability requires the
probability in the denominator to be different from zero.)
To reiterate: Two events are independent if and only if the probability
that both events happen simultaneously is the product of their
unconditional probabilities.
If two events are not independent, they are dependent.

Independence and Mutual Exclusivity Are Different!
In fact, the only way two events can be both
mutually exclusive and
independent is if at least one of them has probability equal to zero.
If A and B are mutually exclusive,
learning that B happened
tells us that A did not happen.
This is clearly informative: The conditional probability of A given
B is zero!
This changes the (conditional) probability of A unless its
(unconditional) probability was zero.

Independent events bear a special relationship to each other.
Independence is a very precise point between being disjoint
(so that the occurrence of one event implies that the other did not occur),
and one event being a subset of the
other (so that the occurrence of one event implies the occurrence of the other).
Here is a summary of the contrast between independent events and
mutually exclusive events:

If two events are mutually exclusive, they cannot both occur
in the same trial: The probability of their intersection is zero.
The probability of their union is the sum of their probabilities.

If two events are independent, both can occur in the same trial
(except possibly if at least one of them has probability zero).
The probability of their intersection is the product of their
probabilities.
The probability of their union is less than the sum of their
probabilities, unless at least one of the events has
probability zero.

contains a Venn diagram that represents two events,
A and B, as subsets of a rectangle S.
The probabilities of the events are proportional to their areas.
Initially, the probability of A is 30% and the probability
of B is 20%.
The figure also shows the probability of AB and of
A∪B.
Try to make A and B independent by dragging them to make
the area of their intersection equal to the product of their areas,
so that P(AB) = P(A)×P(B) = 30%×20% = 6%.
It is hard to get just the right amount of overlap:
Independence is a very special relationship between events.

If A and B are independent, so are

A and Bc

Ac and Bc

Ac and B.

What kinds of events are (generally assumed to be) independent?
The outcomes of successive fair tosses
of a fair coin, the outcomes of random draws from a box with replacement, etc.
Draws without replacement are
dependent,
because what can happen on a given
draw depends on what happens on previous draws.
The next two examples illustrate the contrast between independent and dependent events.

Suppose I have a box with four tickets in it, labeled
1, 2, 3, and 4.
I stir the tickets and then draw one from the box,
stir the remaining tickets again without returning the ticket I drew the first
time, and draw another ticket.
Consider the event A = {I get the ticket labeled 1 on the first draw} and the
event B = {I get the ticket labeled 2 on the second draw}.
Are A and B dependent or independent?

Solution:
The chance that I get the 1 on the first draw is 25%.
The chance that I get the 2 on the second draw is 25%.
The chance that I get the 2 on the second draw given that
I get the 1 on the first draw is 33%, which is much larger
than the unconditional chance that I draw the 2 the second time.
Thus A and B are dependent.

Now suppose that I replace the ticket I got on the first
draw and stir the tickets again before drawing the second time.
Then the chance that I get the 1 on the first draw is 25%,
the chance that I get the 2 on the second draw is 25%, and
the conditional chance that I get the 2 on the second draw
given that I drew the 1 the first time is also 25%.
A and B are thus independent if I draw with replacement.

Two fair dice are rolled independently; one is blue, the other is red.
What is the chance that the number of spots that show on the red die is
less than the number of spots that show on the blue die?

Solution: The event that the number of spots that
show on the red die is less than the number that show on the blue
die can be broken up into mutually exclusive events,
according to the number of spots that show on the blue die.
The chance that the number of spots that show on the red die is
less than the number that show on the blue die is the sum of the
chances of those simpler events.
If only one spot shows on the blue die, the number that shows on the
red die cannot be smaller, so the probability is zero.
If two spots show on the blue die, the number that shows
on the red die is smaller if the red die shows exactly one spot.
Because the numbers of spots that show on the blue and red dice are
independent, the chance that the blue die shows two spots and the
red die shows one spot is (1/6)(1/6) = 1/36.
If three spots show on the blue die, the number that shows on the red
die is smaller if the red die shows one or two spots.
The chance that the blue die shows three spots and the red die
shows one or two spots is (1/6)(2/6) = 2/36.
If four spots show on the blue die, the number that show on the
red die is smaller if the red die shows one, two, or three spots;
the chance that the blue die shows four spots and the red die
shows one, two, or three spots is (1/6)(3/6) = 3/36.

Proceeding similarly for the cases that the blue die shows five or
six spots gives the ultimate result:

Alternatively, one could just count the ways:
There are 36 possibilities, which can be written in a square table as follows.

The 36 possible outcomes of rolling two dice

Blue Die

R
e
d

D
i
e

1,1

1,2

1,3

1,4

1,5

1,6

2,1

2,2

2,3

2,4

2,5

2,6

3,1

3,2

3,3

3,4

3,5

3,6

4,1

4,2

4,3

4,4

4,5

4,6

5,1

5,2

5,3

5,4

5,5

5,6

6,1

6,2

6,3

6,4

6,5

6,6

The outcomes above the diagonal comprise the event whose probability we seek.
There are 36 outcomes in all, of which 6 are on the diagonal.
Half of the remaining 36-6=30 are above the diagonal; half of 30 is 15.
The 36 outcomes are equally likely, so the chance is 15/36.
The outcomes highlighted in yellow—(1,4), (2,4) and (3,4)—comprise one of the
mutually exclusive pieces used in the computation in
namely, the three ways the red die can show a smaller number of spots than the blue die,
when the blue die shows exactly 4 spots.

This is called the Multiplication Rule.
The following two examples illustrate the Multiplication Rule.

A deck of cards is shuffled well, then two cards are drawn.
What is the chance that both cards are aces?

Solution: Apply the Multiplication Rule.

P(card 1 is an Ace and card 2 is an Ace) =
P(card 2 is an Ace | card 1 is an Ace)×P(card 1 is an Ace)

= 3/51 × 4/52 = 0.5%.

You can see that the Multiplication Rule can save you a lot of time!

Suppose there is a 50% chance that you catch the 8:00am bus.
If you catch the bus, you will be on time.
If you miss the bus, there is a 70% chance that you will be late.
What is the chance that you will be late?

Bayes' Rule

The numerator on the right is P(AB), computed using the
Multiplication Rule.
The denominator is just P(B), computed by
partitioningB
into the mutually exclusive sets AB and AcB,
and finding the probability of each of those pieces using the
Multiplication Rule.

Bayes' Rule is useful to find the conditional probability of
A given B in terms of the conditional probability
of B given A,
which is the more natural quantity to measure in some
problems, and the easier quantity to compute in some
problems. For example, in screening for a disease, the
natural way to calibrate a test is to see how well it does at
detecting the disease when the disease is present, and to see
how often it raises false alarms when the disease is not
present.
These are, respectively, the conditional probability of
detecting the disease given that the disease is present,
and the conditional probability of incorrectly raising an alarm
given that the disease is not present.
However, the interesting quantity for an individual is the
conditional chance that he or she has the disease,
given that the test raised an alarm.
An example will help.

Suppose that 10% of a given population has benign chronic flatulence.
Suppose that there is a standard screening test for benign chronic
flatulence that has a 90% chance of correctly detecting that one has
the disease, and a 10% chance of a false positive
(erroneously reporting that one has the disease when one does not).
We pick a person at random from the population (so that everyone has
the same chance of being picked) and test him/her.
The test is positive. What is the chance that the person has the disease?

Solution:
We shall combine several things we have learned.
Let D be the event that the person has the disease, and let
T be
the event that the person tests positive for the disease.
The problem statement told us that:

P(D) = 10%.

P(T | D) = 90%.

P(T | Dc) = 10%.

The problem asks us to find P(D | T) = P(DT)/P(T).
We shall find P(T)
by partitioning T into two mutually exclusive
pieces, DT and DcT, corresponding to
testing positive and having the disease (DT) and testing positive falsely
(DcT).
Then P(T) is the sum of P(DT)
and P(DcT).
We will find those two probabilities using the Multiplication Rule.
We need P(DT) for the numerator, and it will be one of the
terms in the denominator as well.
The probability of DT is, by the Multiplication Rule,

P(DT) = P(T | D) × P(D) = 90% × 10% = 9%.

The probability of DcT is, by the multiplication
rule and the complement rule,

because DT and DcT are mutually exclusive.
Finally, plugging in the definition of P(D | T) gives:

P(D | T) = P(DT)/P(T) = 9%/18% = 50%.

Because only a small fraction of the population actually have benign chronic
flatulence, the chance that a positive test result for someone
selected at random from the population is a false positive is
50%, even though the test is 90% accurate.
The computation we just made is equivalent to using Bayes' rule:

P(D | T) =
P(T | D)×P(D)/(P(T | D)×P(D) +
P(T | Dc)×P(Dc) )

= 90%×10%/( 90%×10% + 10%×90%)

= 50%.

The Base Rate Fallacy
consists of ignoring P(A)
or P(B)
in computing P(B | A) from P(A | B) and
P(A | Bc).
For instance, in the example above, the base rate for chronic benign flatulence is 10%.
The test is 90% accurate (both for false positives and for false negatives).
The base rate fallacy is to conclude that since the test is 90% accurate, it must be true that 90% of people
who test positive in fact have the disease—ignoring the base rate of the disease in the population
and the frequency of false positive test results.
We just saw that that conclusion is wrong: if people are tested at random, of those who test positive,
only 50% have the disease, on average.

The Prosecutor's Fallacy
consists of confusing
P(B | A) with P(A | B).
For instance, P(A | B) might be the probability of some evidence
if the accused is guilty, P(B | A) is the probability that
the accused is guilty given the evidence.
The second "conditional probability" generally does not make sense at all;
even when it does,
its numerical value need not be close to the value of P(A | B).

The following exercises check your ability to work with conditional
probability, the Multiplication Rule, and Bayes' Rule.

Videos of Exercises

(Reminder: Examples and exercises may vary when the page is reloaded; the video shows only one version.)

Example of Bayes' Rule in Disease Screening

Summary

The Axioms of Probability are mathematical rules that
must be followed in assigning probabilities to events:
The probability of an event cannot be negative, the probability that
something happens must be 100%, and if two events cannot both occur,
the probability that either occurs is the sum of the probabilities
that each occurs.
A function that assigns numbers to events and satisfies the axioms
is called a probability distribution.

The axioms have numerous consequences, including the following:
The probability of the empty set is zero.
The probability that a given event does not occur is 100% minus the
probability that the event occurs.
The probability that either of two events occurs is the sum of the
probabilities that each occurs, minus the probability that both occur.
The probability that either of two events occurs is at least as large
as the probability that each occurs, and no larger than the sum of
the probabilities that each occurs.
The probability that two events both occur is no larger than either
of their individual probabilities.

Conditioning describes updating probabilities to incorporate new knowledge.
For example, how should we update the probability of the event A
if we learn that the event B occurs?
The updated probability is the conditional probability ofAgivenB,
which is equal to the probability
that A and B
both occur, divided by the probability that B occurs,
provided that the probability that B occurs is not zero.
Conditional probability satisfies the axioms of probability.

Rearranging the definition of conditional probability yields the
Multiplication Rule:
The probability that A and B both
occur is the conditional probability of A given B, times the
probability that B occurs.
Two events are independent if the occurrence of one is
uninformative with respect to the occurrence of the other:
if P(A | B) = P(A).
A slightly more general definition is that A and
B are independent
if P(AB) = P(A)×P(B).

Bayes' Rule expresses P(A | B) in terms
of P(B | A),
P(B | Ac), and
P(A),
which in some problems are easier to calculate than P(A | B).
Bayes' Rule says that