Three types of information help adequately describe
a distribution: its shape, its central tendancies,
and how it is dispersed. This lesson deals
primarily with measures of central tendancy
and measures of dispersion.

Average most often refers to the
arithmetic mean, but is actually ambiguous
and may be used to also refer to the mode, median, or midrange.

You should always clarify which average is being used, preferrably by
using a more specific term. Averages give us information about a typical
element of a data set. They are measures of central tendency.

Mean most often refers to the
arithmetic mean, but is also ambiguous.
Unless specified otherwise, we will assume arithmetic mean whenever
the term mean is used.

The
Arithmetic Mean is
obtained by summing all elements of the data set
and dividing by the number of elements.

Other means, such as geometric, harmonic, quadratic, trimmed,
and weighted will not be discussed here but can be found
in statistics intro lesson 4.

Symbolically, the arithmetic mean is expressed as
where
(pronounced "x-bar") is the arithmetic mean for a sample and
is the capital Greek letter sigma and
indicates summation.
xi refers to each element of the data set as i ranges
from 1 to n. n is the number of elements in the data set.
The equation is essentially the same for finding a population mean;
however, the symbol for the population mean is the
small Greek letter µ (mu).
Roman letters usually represent sample statistics,
whereas Greek letters usually represent population parameters.

When arithmetic means are combined for different groups,
we must take into account the possibly disparent number
of data elements in each group and weight the means accordingly.

Example: Suppose there are 10 freshmen boys and 20 freshmen girls.
Suppose further that the boys' test average was 72.5 and
the girls' test average was 73.7. Find the overall average
(arithmetic mean).
Solution: (10×72.5+20×73.7)/30=73.3.

The arithmetic mean has two important properties which
make it the most frequently used measure of central tendancy:
1) the sum of all deviations from the mean is zero; and
2) the sum of the squares of the deviations from
the mean is minimized.
Deviation here refers to the directed distance
(i.e. plus or minus sign included)
a given score is from the mean.

Sample Size is the number of elements in a sample. It is referred
to by the symbol n.

Be sure to use a lower case n for sample size.
An upper case N refers to Population Size,
unless being used in the context of a
normally distributed population.

A useful mnemonic is to alliterate the words mode and most.
Alliterations start with the same sound like:
"seven slippery slimy snakes...".

A data set with only one mode is termed unimodal.
Some data sets contain no repeated elements.
In this case, there is no mode (or the mode is the empty set).
It is also possible for two or more [nonadjacent]
elements to be repeated with the same frequency.
In these cases, there are two or more modes and
the data set is said to be bimodal or multimodal.
In the rare instance of a uniform or nearly uniform distribution,
one where each element is repeated the same
or nearly the same number of times, one could
term it multimodal, but some authors invoke subjectivity by
specifying multimodality only when separate, distinct, and fairly
high peaks (ignoring fluctuations due to randomness) occur.

For binned data, such as occurs with a frequency table,
the interval which contains the most items is the
modal interval and the midpoint of this interval
is considered the mode.
The mode is rather unsophisticated, tends to provide little
information, and does not readily lend itself to mathematical
manipulation. It thus has limited value except when there are
a large number of scores and it can help describe the
distribution or when used for nominal variables.

The Median is the middle element
when the data set is arranged in order of magnitude.

A useful mnemonic is to remember that the median is the grassy strip
(in the rural area of the midwest where I come from)
that divides opposing lanes in a highway. It is in the middle.

If there are an odd number of data elements, the median is a member
of the data set. If there are an even number of data elements, the median
is computed as the arithmetic mean of the middle two.

The median has other names, such as P50,
which will be discussed below.
The Hinkle textbook uses the symbol Mdn for median.

The Midrange
is the arithmetic mean of the highest and lowest data elements.

Midrange is a type of average.
Range is a measure of dispersion and
will be discussed below.
A common mistake is to confuse the two.
Symbolically, midrange is computed as (xmax+xmin)/2

The mode, if it exists, and possibly the median are elements of the
data set. As such, they should be specified no more accurately than
the original data set elements.

The midrange and possibly the median are the arithmetic mean of two
data set elements. One additional significant digit may be necessary
to accurately convey this information.

The number of significant digits for the mean should conform to
one of the following rules.

The significant digits should be no more than the number of
significant digits in the sum of the data elements. Since the
sample size (n) is an exact value, it has no affect on
the number of significant digits obtained from the division.
This is sometimes simplified as a rule of thumb by stating that the mean
should be given to one more decimal place than the original data.
However, this assumes the data set is small (n < 100)
and that the data was recorded to a consistant precision.

The number of significant digits should be consistant
with the precision obtained for the standard deviation.

It is not uncommon in science for
results to be left in and interim calculations sometimes rounded to
three significant digits, which is about all you could get out of
a slide rule. Hence, this was commonly termed slide rule accuracy.
In pre-calculator days, this also made hand calculations easier.

The important thing to remember is not to write down twelve decimal places
without good reason, even though your calculator will often display such.

Presenting more than five significant digits is probably a joke and
points will be deducted!

As we have seen in this lecture, this is a rather ambiguous question and
the answers 1 (mode), 2 (median), 3.0 (mean), and 4.0 (midrange) are
all possible and correct!

A sample of size 5 (n=5) is taken of student quiz scores with the
following results: 1, 7, 8, 9, 10.

The mean is (1+7+8+9+10)/5 = 35/5 = 7.0
(note one more decimal place is given).

All scores occur only once, hence there is no mode.
The median score is 8 (not 8.0).
The midrange is (10+1)/2 = 5.5 (note the extra decimal place is required).

An extreme score (1) distorts the mean so perhaps the median is a better
measure of central tendency. For a larger data set, this could be further
defined in terms of skewness (median and generally mean to the
left of (negatively skewed), right of (positively skewed),
or same as (zero skewness)
the mode) and symmetry of the data set.
It is more common to be positively skewed, since exceptionally large
values are easier to obtain due to lower limits.

Another important characteristic of a data set
is how it is distributed, or how far each element
is from some measure of central tendancy (average).
There are several ways to measure the variability of the data.
Although the most common and most important is the standard deviation,
which provides an average distance for each element from the mean,
several others are also important, and are hence discussed here.

Symbolically, range is computed as xmax-xmin.
Although this is very similar to the formula for midrange, please do not
make the common mistake of reversing the two.
This is not a reliable measure of dispersion,
since it only uses two values from the data set.
Thus, extreme values can distort the range to be very large
while most of the elements may actually be very close together.
For example, the range for the data set 1, 1, 2, 4, 7 introduced earlier
would be 7-1=6.

Recently it has come to my attention that a few books
define statistical range the same as its more mathematical usage.
I've seen this both in grade school and college textbooks.
Thus instead of being a single number it is the interval over which
the data occurs. Such books would state the range as
[xmin,xmax] or
xmin to xmax.
Thus for the example above, the range would be
from 1 to 7 or [1,7]. Be sure you do not say 1-7 since this could
be interpretted as -6.

Hinkle defines range as (Highest score - Lowest score) + 1,
where the +1 ensures that both extreme values are included.
Although he notes the definition given above, he does note
that this +1 definition is used throughout the book.
The appropriateness of this modification increases
as the level of measurement decreases.

The Standard deviation is another way to calculate dispersion.
This is the most common and useful measure because it is the
average distance of each score from
the mean. The formula for sample standard deviation is as follows.

sample standard deviation

Notice the difference between the sample and population standard deviations.
The sample standard deviation uses n-1 in the
denominator, hence is slightly larger than the
population standard deviation which use N (which
is often written as n).

population standard deviation

It is much easier to remember and apply these formulae,
if you understand what all the parts are for.
We have already discussed the use of Roman vs. Greek letters
for sample statistics vs. population parameters.
This is why s is used for the sample standard deviation
and
(sigma) is used for the population standard deviation.
However, another sigma, the capital one
(),
appears inside the formula. It serves to indicate that we are
adding things up.
What is added up are the deviations from the mean:
- xi.
But the average deviation from the mean is actually
zeroby definition of the mean!
Occasionally the mean deviation, using average distance
or using the symbols for absolute value:
| - xi|
is used.
However, a better measure of variation comes from squaring each deviation,
summing those squares, then taking the square root after dividing by
one less than the number of data elements.
This is very similar to a
quadratic mean.
The n-1 can be understood in terms of
degrees of freedoma
topic we will have to cover for inferential statistics.

Another formula for standard deviation is also commonly encountered.
It is as follows.

Shortcut formula for standard deviation

This formula can be algebraically derived from the former
and has two primary applications.
First, calculators and computer programs often employ it because
less intermediate results are necessary and it can be calculated in one
pass through the data set.
That is, you don't have to calculate the mean first and then
find the deviations.
Second, it is closely related to a formula which may be used to
calculate the standard deviation for a frequency table.
In general, the formulae are not used and we rely
instead on calculators or computers.

Variance is the third method of measuring dispersion.
Compare the two variance formulae with their corresponding standard deviation
formulae, and we see that variance is just the square of the standard deviation.
Statisticians tend to consider variance a primary measure and use it
extensively (ANOVA, etc.),
whereas scientists are very happy to use standard deviation exclusively.
Personally, I have difficulty conceptualizing
square points or square dollars.

Occasionally, the abbreviations SD for standard deviation
and Var for variance will be seen.

It can take some time to start to understand how these
measures of variation may be useful.
Consider the following scenerios. First, if a straight five points are
added to everyone's score,
the mean would increase five points, say from 70.8 to 75.8
but have no affect on the standard deviation.
It remains, say, at 10.9.
Second, if each test score was multiplied by .89
and then 21 points were added,
not only does this move the mean from, say, 55.4 to 70.3,
but it also reduced the standard deviation from, say, 15.0 to 13.5.
This can be useful if the original test scores were very variable,
and could easily have resulted in more D's and F's than
your efforts justified. You might consider a third common way to adjust
test scores, that of dropping the possible. Technically this doesn't
change either the mean or the standard deviation, but it does effectively
raise everyone's percentage. This doesn't help the lower scoring
students nearly as much as it helps the top students.

A commonly given rule of thumb is that the range of a data set
is approximately 4 standard deviations (4s). Thus the maximum
data element will be about 2 standard deviations above the mean
and the minimum data element about 2 standard deviations below the mean.

The standard deviation of a data set is often used in science as
a measure of the precision to which a experiment has been done.
It can also indicate the reproducibility of the result.
Propagation of error dictates that
intermediate values in your calculations should not be rounded.
At least twice as many digits as will be used in the final answer
should be retained.

It is rather meaningless to calculate the standard deviation for a
data set of two elements.

Three is considered the smallest sample size
where standard deviation is meaningful.

It is not uncommon for an experiment to involve millions of events
and associated data.
If you examine the standard deviation formula above,
you will note that it depends inversely on the square root of n.
We could thus expect to reduce the standard deviation of our answer by
perhaps a thousand fold.
It is the goal of many experiments to obtain very precise values, so
great care is exercised to reduce systematic errors and also reduce
the affect of random errors by increasing the repetitions.

Example:
Consider a simple example of counting pennies where the outcomes
99, 100 and 101 are obtained. Find the mean and standard deviation.Solution: We can easily calculate the mean as 100
and the standard deviation as 1.0.

Example:
Consider further if this exercise were repeated 1000 times and 100
was obtained 991 times, 99 5 times and 101 4 times.
Again, calculate the mean and standard deviation.Solution:
The mean is now 99.999 and the standard deviation is now 0.095.
Here the additional precision is justified and the mean and
standard deviation are given to the same 3 decimal place precision.
It would be a mistake to report these results to only one more digit
than the original data set, as in 100.0 and 0.1.

DO NOT USE a rounded s to obtain s2.
Variance is the primary statistic, s is a derived quantity.

Standard deviation should be reported to at least one more
decimal place than the data, or three significant digits.

We often find it useful to calculate how far,
in standard deviations, a data element was from the mean.
This is a very widely used procedure and this measure has
the name z-score.
It is also termed a standard score.
Since many data sets have
a somewhat normal distribution, it is a very helpful way
to compare data elements from different populationspopulations
which may very well have differing means and standard deviations.
However, we will be discussing the normal distribution tomorrow.

A typical example might be
ACTandSAT scores.
ACT scores range from 1 to 36 with a national mean of about 21.0
and standard deviation of about 4.7.
SAT scores range from 200 to 800 (for each subtest)
with a national mean of about 508 and standard deviation of about 111.
Both ACTs and SATs appear to be approximately normally distributed.
High school students often take both, perhaps several times
and those from a particular school would represent a sample.
This sample would have its own mean and standard deviation,
but of course, these would be statistics, not parameters.
(Our Math and Science Center students average about 1050 (total) when they take
the SAT their eighth grade year and average over 1300 (total) when they take
it their junior year. Our average ACT score (junior) is about 29.)
The formulae used for z-score appear in two virtually identical
forms, recognizing the fact that we may be dealing with sample statistics
or population parameters. These formulae are as follows.

z-score formulae

The following important attributes should be noted about z-scores.

Negative z-scores indicate a data element's position below the mean.

Positive z-scores indicate a data element's position above the mean.

z-scores should always be rounded to two decimal places.

IQs of 0 and 210 will be discussed in
lesson 4 and
z-scores of -6.67 and 7.33 should be obtained respectively,
based on a population mean of 100 and a standard deviation of 15.

The population does not have to be normally distributed to calculate
z-scores, but that is one of its primary applications.

In summary, z-scores provide a useful measurement for comparing
data elements from different data sets.

Now that we have defined z-score, we can define two more terms as follows.

Data elements more than 2 standard deviations away from the
mean are termed unusual.

Data elements less than 2 standard deviations away
from the mean are termed ordinary.

As you will recall, in a normally distributed population, 95% of the data
will then be ordinary, so only 5% can be unusual. Chebyshev's theorem guarantees
at least 75% of the data to be ordinary, so no more than 25% can be unusual.

Yet another method of measuring how a data set is distributed is to
extend the concept of median and use smaller and smaller divisions.
The first division we will examine is the quartile.

Note first how the median divides a population into two halves:
a top half and a bottom half.

The top half consists of those data elements above the median,
whereas the bottom half consists of those data elements below the median.
If we subdivide each of these halves yet again, we have quartered
the population and each of these division points are termed quartiles.
Although one might occasionally speak of the bottom quartile,
top quartile, etc., the term quartile technically
refers to the three division points and not to the four divisions of the data.

Q1 is the term used for the median of the bottom half.

Q3 is the term used for the median of the top half.

Q2 is another term used for the median.

The precise definition specifies that at least 25% of the data will be
less than or equal to Q1
and at least 75% of the data will be
less than or equal to Q3.
The terms upper (right) and lower (left) hinge are noted below
and some software packages may not clearly differentiate
between hinges and quartiles.
All these measures of position assume the data
is quantitative and can be put in numeric order.

Data are ranked when arranged in [numeric] order.

Since range is sensitive to outliers (defined below),
sometimes the interquartile range is calculated.
This range is the difference between the third and first quartiles:
Q3-Q1. It is another measure of dispersion.
Other common terms include: semi-interquartile range,
(Q3-Q1)/2,
another measure of dispersion,
and midquartile or
(Q1+Q3)/2,
which is a measure of central tendancy (an average).

Another common term is hinge.
There is a left or lower hinge and a right or upper hinge.

The upper hinge is the median of the upper half of all scores,
including the median.

The lower hinge is the median of the lower half of all scores,
including the median.

Outliers are extreme values in a data set.
Sometimes the term outlier is applied to unusual values as defined above
(Triola, 5th edition).
More recently, outliers are defined in terms of the hinges or
quartiles. Outliers are often differentiated as mild
or extreme as defined below.
The interquartile range or perhaps
D = upper hinge - lower hinge is used.
Generally, an outlier should be obvious and not borderlineright next to
another element, but lying just outside some arbitrary line of demarcation.

Consider as an example the data set: {0, 2, 4, 5, 6, 3, 6, 1, 1, 50}.
Obviously, 50 is a much larger number than any of the other elements.
This outlier will cause the mean and variance to be much higher.
Specifically, without 50, the mean is 3.1 and standard deviation 2.3,
whereas with 50, the mean is 7.8 and standard deviation 15.0.
Note that the quartiles are 1 and 6, whereas the hinges are 1.5 and 5.5
for the unmodified data set.
For any of these definitions, 50 is way away from the other data and is an outlier.
Outliers might be legitimate data values or errors.
This 50 might really have been 5.0 and was miscoded
(historically, punch card input was column sensitive) or poorly
recorded in a lab book, with the decimal point extremely light or missing.
50 may also represent extreme extra credit on a 5 point quiz!
It is not unusual to be tempted to omit such data values.
It is not considered a good practice, but if such are omitted,
be sure to clearly record that fact.
You will have just crossed the line between objective and subjective science.

Although not nearly as common as percentiles which follow below,
deciles are yet another fractile which serve to partition
data into approximately equal parts. Hence, just as there are three
quartiles which divide a population into four parts,
so too are there nine deciles dividing the population into ten parts.
The deciles are termed D1 through D9.

The term stanine is derived from standard nine and
stanine scores range from 1 to 9 with 5 in the center.
Except for 1 and 9, each stanine includes a band of scores
one half a standard devaition wide.
Thus stanine scores are standard scores with a mean of 5
and a standard deviation of 2.
Test scores are commonly expressed using these
single-digit scores which can help students and parents
visualize where someone falls on the test scale.

Psycholgists and counselors frequently provide
Norm-referenced interpretation of a scores
for personality inventory and achievement tests.
This typically means correlating a given score with a given percentile.

Percentiles are also like quartiles,
but divide the data set into 100 equal parts.
Each group represents 1% of the data set.
There are 99 percentiles termed
P1 through P99.

P50 is yet another term for median.

Other equivalents, such as P25=Q1,
P75=Q3,
P10=D1, etc.,
should also be obvious.
Once again, the term percentile technically refers to the
99 division points, but is not uncommonly used to refer to the 100 divisions.
For large data sets, one can calculate the locatorL to
help find a requested percentile. It is computed as follows.

Percentile Locator Formula

k is the percentile being sought and n, of course,
is the number of elements in our data set.
Usual conventions dictate that once L is obtained,
it must be checked to see if it is a whole number.
If it is a whole number, the value of Pk
is the mean of the Lth data element and the next higher data element.
If it is not a whole number, L must be
rounded up to the next larger whole number.
The value of Pk is then the
Lth data element, counting from the lowest.
There is an essential difference between rounding up and rounding off.
If we round off we get 3.
Whereas, if we round up we get 4.
Hinkle gives a different formula which is applicable
when the data is binned.
Since percentiles are ordinal, a limited number of
statistical operations are approriate for them.

The percentile rank of a score is a point
on the percentile scale that gives the percent
of scores at or below the specified score.
When percentiles and scores are graphed
in a cumulative frequency polygon or ogive,
one can read a score on one axis and find percentile on the
other or percentile on one and the corresponding
percentile rank (a score) on the other.

Another useful summary for a data set is known as a
5-number summary.
We have already seen the middle three members as the quartiles.
The other two members,
the minimum and maximum, were used earlier to calculate the range.
These should be presented in ascending order.
If the lower and upper hinges are defined differently from the quartiles,
they should be used instead of Q1 and
Q3 in a 5-number summary.
Any statistical calculator or software package should
easily provides you with a 5-number summary.

A boxplot or box and whiskers plot is a visual representation
of the 5-number summary. The diagram is a quick way to spot skewed data.
Illustrated below is a boxplot from the TI-83+ graphing calculator, along with
the window and other settings for the US Presidential Inauguration data.

The whiskers extend from either 1.5 inner quartile range above and below
the quartiles or from the minimum to maximum values.
The former is termed a modified box plot and will have
outliers individually plotted via a symbol of your choice.
Note that Hinkle presents box plots vertically while
many other authors use a horizontal approach.