Classical Biases in Common Statistics

Written 2001

Formatted 2009

Although statistics can be very valuable, we do have to be clear on what
they can tell us, and what they can't. Understanding where the biases
come from will help promote understanding that distinction. As you look
at these examples ask, "How can I be sure not to read in more than is
actually there?"

This page is laid out in a form that would make an easy discussion for
a classroom lesson.

Example 1: Economic Comparisons

Look at the following two groups and ask yourself which group is better
off. Each group has four members.

Group A

dead

in jail

earns $30,000

earns $50,000

Group B

earns $10,000

earns $20,000

earns $30,000

earns $50,000

Which group did you decide? Lets look at what will be reported to us. For
group A, the average income is $40,000, because the dead person and the
imprisoned person have been removed from the statistical process. For group
B the average income $27,500, about 2/3 the average income of group A.

If you look only at the average incomes which group do you consider better
off? Group A has the higher average. Is Group A really better off? Which
is better poor or dead? Poor or in jail?

Example 2: More Economic Comparisons

How do these two groups compare?

Group A

Group B

earns $10,000

earns $10,000

earns $10,000

earns $1,000,000

earns $20,000

earns $30,000

earns $30,000

earns $40,000

Average = $260,000

Average = $30,000

Is Group A, with an average income of $340,000 better off than Group B with
an average income of $30,000. Group A's average income is over ten times
larger than Group B's. But three quarters of Group A earn less than every
member of Group B! Did the average really give you useful information? What
measure would have been more informative than average?

Example 3: Economic Combined Effects

What happens to the numbers when you combine the two effects above?

Group A

dead

in jail

earns $10,000

earns $1,000,000

Group B

earns $10,000

earns $20,000

earns $20,000

earns $50,000

The average incomes are group A: $505,000, and group B: $25,000. The averages
say that group A earns nearly 20 times as much as group B. Do you consider
this claim accurate, considering that 3/4 of the members of Group B are
better off than 3/4 of the members of Group A? What measure would have been
more informative than average?

Statisticians frequently talk about normalizing data, that is correcting
for intrinsic errors. How would you normalize this data to account for
the dead and jailed persons being removed from the data set?

Example 4: Life expectancy - Remote Location

Imagine a 40 year old pregnant woman, discouraged in life, retreating
to a remote desert and dying immediately after labor. Only two people,
mother and child, have settled in this place, so it is easy to calculate
the life expectancy (40 + 0) / 2 = 20. If you go to that location should
you expect to die when you are 20?

Example 5: Life Expectancy - Childhood Illness

Imagine a small town with a high infant mortality rate. The recorded
deaths have occurred at these ages: ten infants have died in their first
year, and five adults at the ages 60, 70, 75, an d 80. This towns life
expectancy will be calculated as 20. Should the young start worrying as
they approach the age of 20?

Do life expectancy numbers describe something that any individual within
that group should expect?

Example 6: Generational Changes

How do these two generations compare?

Generation 1

Generation 2

Parent of 1 earning $60,000

1 grown child earning $60,000

Parent of 5 earning $20,000

5 grown children each earning $20,000

Average income: $40,000

Average income: $27,000

The average income dropped from $40,000 to $27,000 from one generation to
the next, yet the offspring grew up to earn the same as their parents. Is
correct to say that incomes dropped? Is it correct to say that incomes stayed
the same? How could this data be presented in a clearer way than average
income?

Comparing real data to our examples

In many locations and times in history life expectancy has been reported
to be less than 30. How should we interpret this? Did most people really
die when they were 30? How would this have affected families? How old
would most children have been when their parents died? Who would have
raised the children?

Many observers make powerful claims about the members of different groups
after average incomes are compared? Do averages really represent the individuals?
Imagine what the average income data will look like for any group that
Bill Gates is a member of. During the first decade of the 21st century,
the average income rose, but the median income stagnated. What did this
mean?

How would you generalize these examples to other statistical data sets?
What alternatives to averaging would be more informative? How would you
represent the difference between the lowest and the average? Or the highest
and the average? Do some biased data sets lend themselves well to normalizing?
If so, How would you do it?