We will first talk about the important concepts of statistical inference. Then a few descriptive measures of the most important characteristic of a data set, central tendency, will be given. After that, a few descriptive measures of the other important characteristic of a data set, measure of variability, will be discussed. This lesson will be concluded by a discussion of box plots, which are simple graphs that show the central location, variability, symmetry, and outliers very clearly.

To summarize a data set, we want to report different attributes of the data set. One important attribute is the central tendency of the data, the other important attribute is how spread out the data is. Then, some more attributes to report is the shape of the data, etc. In this lesson, we will mainly discuss the first two important attributes, central tendency and spread.

Measures of Central Tendency

Examination of all members of a population is not typically conducted due to the cost and time required. Instead, we typically examine a random sample, i.e., a representative subset of the population.

Descriptive measures of population are parameters. Descriptive measures of sample are statistics. For example, sample mean is a statistic and population mean is a parameter. The sample mean is usually denoted by \(\bar{y}\):

\[\bar{y}=\frac{y_1+y_2+\ldots+y_n}{n}=\frac{\sum^n_{i=1} y_i}{n}\]

where n is the sample size and yi are the measurements. One may need to use the sample mean to estimate the population mean since usually only a random sample is drawn and we don't know the population mean.

Note that the data set:

1, 1, 2, 3, 13mean = 4, median = 2, mode = 1

Mean, median and mode are usually not equal. When the data is symmetric, mean is equal to median.

4. Trimmed Mean

One shortcoming of mean: Means are easily affected by extreme values.

Aptitude test scores of ten children:

95, 78, 69, 91, 82, 76, 76, 86, 88, 80

Mean = (95+78+69+91+82+76+76+86+88+80)/10 = 82.1

If the entry 91 is mistakenly recorded as 9, the mean will be 73.9, very different from 82.1.

On the other hand, let us see the effect of the mistake on the median value:

The original data sets in increasing order are:

69, 76, 76, 78, 80, 82, 86, 88, 91, 95median = 81

The data set (with 91 coded as 9) in increasing order is:

9, 69, 76, 76, 78, 80, 82, 86, 88, 95median = 79

The medians of the two sets are not that different. It is not that affected by the extreme value 9.

Measures that are not that affected by extreme values are called resistant.

A variation of the mean is the trimmed mean. A 10% trimmed mean drops the highest 10%, the lowest 10%, and averages the remaining.

(69), 76, 76, 78, 80, 82, 86, 88, 91, (95)10% trimmed mean = 82.13

(9), 69, 76, 76, 78, 80, 82, 86, 88, (95)10% trimmed mean = 79. 38

The 10% trimmed mean of the two sets is not that different. It is not as affected by the extreme value 9 as the mean.

After reading lesson 2.1, you should know that there are quite a few options when one wants to describe central tendency, for example, mean, median, mode and trimmed mean. In future lessons, we talk about mean mainly. However, we need to be aware of one of its short coming, which is that it is easily affected by extreme values. As one of the remedy, one can use trimmed mean to estimate the central tendency. However, that is very different from saying that one can trim data. Unless the data points are mistakes, one should not remove them from the data set. One should keep the extreme points and use more resistant measures. For example, one can use sample median to estimate the population median. Or, one can use sample trimmed mean to estimate the population trimmed mean. Again, that is very different from saying that one can trim data from a data set.

Skewness

Skewness is a measure of degree of asymmetry of the distribution.

1.Symmetric

Mean, median, and mode are all the same here; mound shaped, no skewness (symmetric).

The above distribution is symmetric.

2.Skewed Left

Mean to the left of the median, long tail on the left.

The above distribution is skewed to the left.

3.Skewed Right

Mean to the right of the median, long tail on the right.

The above distribution is skewed to the right.

When one has very skewed data, it is better to use the median as measure of central tendency since the median is not much affected by extreme values.

Note:IQR is not affected by extreme values. It is thus a resistant measure of variability.

C. Variance and Standard Deviation

Two vending machines A and B drop candies when a quarter is inserted. The number of pieces of candy one gets is random. The following data are recorded:

Pieces of candy from vending machine A:

1, 2, 3, 3, 5, 4mean = 3, median = 3, mode = 3

Pieces of candy from vending machine B:

2, 3, 3, 3, 3, 4mean = 3, median = 3, mode = 3

Dotplots for pieces of candy from vending machine A and vending machine B:

They have the same center, what about their spreads? One way to compare their spreads is to compute their standard deviations. In the following section, we are going to talk about how to compute sample variance and sample standard deviation for a data set.

Variance is the average squared distance from the mean.

Population variance is defined as:

\[{\sigma}^2=\sum_{i=1}^N \frac{(y_i-\mu)^2}{N}\]

In this formula μ is the population mean and the summation is over all possible values of the population. N is the population size.

The sample variance that is computed from the sample and used to estimate σ 2 is:

\[s^2=\sum_{i=1}^n \frac{(y_i-\bar{y})^2}{n-1}\]

Why do we divide by n - 1 instead of by n? Since μ is unknown and estimated by \(\bar{y}\), the yi's tend to be closer to \(\bar{y}\) than to μ. To compensate, we divide by a smaller number, n - 1.

\(\sigma=\sqrt{{\sigma}^2}\) has the same unit as yi's. This is a desirable property since one may think about the spread in terms of the original unit.

σ is estimated by the sample standard deviation s :

\[s=\sqrt{s^2}\]

For the data set A,

\(s=\sqrt{2}=1.414\) pieces of candy.

Calculate the standard deviation for data set B. Work out your answer first, then click the graphic to compare answers.

Calculate the standard deviation for data set B.

Answer: \(s=\sqrt{0.4}=0.6325\) pieces of candy.

The standard deviation is very useful. One reason is that it has the same unit as the measurements. Also, the empirical rule, which will be explained in the following section, makes the standard deviation an important yardstick to find out approximately what percentage of the measurements fall within certain intervals.

Empirical Rule

Empirical Rule: if the set of measurements follow a bell-shaped distribution, then

In Lesson 2.2 we we describe the empirical rule. The empirical rule helps us to provide an estimate for the standard deviation. Empirical rule says that for a bell shaped curve roughly 68% of the data falls between one standard deviation of the sample mean. Roughly 95% of the data falls between two standard deviations of the mean. And, almost all of the data will fall between three standard deviations of the sample mean.

Using the empirical rule, if your data is roughly bell shaped, then one way to find a rough estimate for the standard deviation of the data set is, to find out the range and then use the range and divide by four.

The reason we divide by four instead of dividing by six is because this will give us a more conservative estimate.

One important point, whenever we want to find out the standard deviation of the data set, we should use the formula for this.

The following five examples (a-e) show that the empirical rule is not that far off even when the underlying distribution is not bell shaped.

Ponder the following, then click on the icon to the left display the statistical application example.

How can one find an approximate value of s without going through the detailed computation? It follows from the empirical rule that approximately 95% of measurements lie in \(\bar{y} \pm 2s\)(almost all).

Range 4s

Approximate value of \(s\approx \frac{range}{4}\)

Why don't we say \(\bar{y} \pm 3s\) contains all and divide by 6 to obtain the approximate value of s?

In the case of approximating s, it is better to overestimate than to underestimate. Dividing by 4 gives a value that is larger than dividing by 6.

It is important to remember that one has to use the formula:

\(s=\sqrt{\sum_{i=1}^n \frac{(y_i-\bar{y})^2}{n-1}}\)

to compute the sample standard deviation. The formula {Approximate value of \(s\approx \frac{range}{4}\) } only gives a rough estimate of s.

For example, the actual ages (in years) of 36 millionaires sampled, arranged in increasing order is:

How to Compute a Five Number Summary

Ponder the following, then click the icon to the left to display the statistical application example.

We want a graph that is not as detailed as a histogram, but still shows:

1. the skewness of the distribution2. the central location3. the variability

The skeletal box plot (box-and-whiskers plot).

We need: min, Q1 (lower quartile), Q2 (median), Q3 (upper quartile), and max. This list is also called the five number summary.

Note: We do not follow our textbook's way to calculate Q1, Q2, and Q3.

The results may sometimes be different from the results in our textbook, but will always be the same as Minitab's result.

Recall that the mean is not a resistant measure of the central location but the median is. Both the range and the standard deviations are not resistant measures of the spread, but the IQR is. Thus, in the box plot we use the median and IQR.

How do we compute quartiles? There are two steps to follow:

Find the location of the desired quartile:

If there are n data, arranged in increasing order, then the first quartile is at position \(\frac{1}{4} (n+1)\), second quartile is at position \(\frac{2}{4} (n+1)\). The third quartile is at position \(\frac{3}{4} (n+1)\).

Find the value in that position for the ordered data.

Once we find the first and the third quartiles, we can compute the interquartile range (IQR) by:

IQR = Q3 - Q1

Roughly speaking, IQR gives the range of the middle 50% of the observations.

The final exam scores of 18 students are (in increasing order):

Q1

Q2

Q3

24

58

61

67

71

73

76

79

82

83

85

87

88

88

92

93

94

97

In this example, n = 18.

For Q1, its position is:

\(\frac{18+1}{4}=4.75\)

The actual value of Q1:

Q1 = 67 (4th position) + 0.75 · (71 - 67) = 70

For Q2, its position is:

\(\frac{18+1}{2}=9.5\)

The actual value of Q2:

Q2 = 82 (9th position) + 0.5 · (83 - 82) = 82.5

For Q3, its position is:

\(\frac{3(18+1)}{4}=14.25\)

Q3 = 88 + 0.25 · (92 - 88) = 89

Thus the five number summary is:

min

Q1

Q2

Q3

max

24

70

82.5

89

97

Five number summary: min, Q1, Q2, Q3, and max.

Using the five number summary, one can construct a skeletal box plot.

Mark the five number summary above the horizontal axis with vertical lines.

Connect Q1, Q2, Q3 to form a box, then connect the box to min and max with a line to form the whisker.

The skeletal box plot of the final exam score:

Box plots are more detailed than skeletal box plots by also showing outliers. The following terminology will prepare us to draw the box plot.

Potential outliers are observations that lie outside the lower and upper limits.

Lower limit = Q1 - 1.5 · IQRUpper limit = Q3 +1.5 · IQR

Adjacent values are the most extreme values that are not potential outliers. For the final exam score data:

Since 24 lies outside the lower and upper limit, it is a potential outlier.

Minitab command for a box plot: Graph > Box plot.

Box plot of final exam score:

How to tell the shape of the distribution by the box plot:

Symmetric

Left skewed

Right skewed

Lesson 2 - Homework

Practice Problems:

1. In a packing plant, a machine packs carton with jars. The times it takes each machine to pack 10 cartons are recorded. The results (machine.txt), in seconds, are shown in the following table:

New machine

Old machine

42.1

41.3

42.4

43.2

41.8

42.7

43.8

42.5

43.1

44.0

41.0

41.8

42.8

42.3

42.7

43.6

43.3

43.5

41.7

44.1

a. Compute the mean and standard deviation for the time to pack a carton for each machine.

b. Plot the data for each machine.

c. Describe the data for the two machines.

2. The College of Dentistry at the University of Florida has made a commitment to develop its entire curriculum around the use of self-paced instructional materials such as videotapes, slide tapes, and so on. It is hoped that each student proceeded apace commensurate with his or her ability and of the instructional staff lab more free time for personal consultation in student – faculty interaction. One such instructional modules developed and tested in the first 50 students proceeding through the curriculum the following measurements represent the number of hours it took the students to complete the required modular material: