2 Displaying Quantitative VariablesQTM1310/ SharpeDisplaying Quantitative VariablesThe monthly changes in a company’s stock prices are shown.It is hard to tell very much from this table of values.Just as with categorical data, we wish to display this quantitative data in a picture to clarify what we see.2

3 Displaying Quantitative VariablesQTM1310/ SharpeDisplaying Quantitative VariablesHistogramsA histogram is similar to a bar chart with the bin counts used as the heights of the bars. Note: there are no gaps between bars unless there are actual gaps in the data.For the stock price data, each bin has a width of $5, and we display how many of the price change values fall into each these bins. For example, we see that there were about 20 monthly price changes that were between $0 and $5.3

4 Displaying Quantitative VariablesQTM1310/ SharpeDisplaying Quantitative VariablesHistogramsHow do histograms work?Decide how wide to make the bins – typically bins are multiples of 5 or 10.Determine the count for each bin.Decide where to place values that land on the endpoint of a bin. For example, does a value of $5 go into the $0 to $5 bin or the $5 to $10 bin? The standard rule is to place such values in the higher bin.4

5 Displaying Quantitative VariablesQTM1310/ SharpeDisplaying Quantitative VariablesHistogramsWe may also choose to create a relative frequency histogram by displaying the percentage of cases in each bin instead of the count.Note: The shape is exactly the same; only the labels have changed.5

6 Displaying Quantitative VariablesQTM1310/ SharpeDisplaying Quantitative VariablesStem-and-Leaf DisplaysStem-and-leaf displays are like histograms, but they also give the individual values.A stem-and-leaf display for the first three years of the monthly stock price data presented earlier is shown below together with a histogram.6

7 Displaying Quantitative VariablesQTM1310/ SharpeDisplaying Quantitative VariablesStem-and-Leaf DisplaysHow do stem-and-leaf displays work?Use the first digit of a number (called the stem) to name the bins. The stem is to the left of the solid line.Use the next digit of the number (called the leaf) to make the “bars”. The leaf is to the right of the solid line.For example, for the number 21, we would write 2 | 1 with 2 serving as the stem, 1 as the leaf, and a solid line in between.7

8 Displaying Quantitative VariablesQTM1310/ SharpeDisplaying Quantitative VariablesStem-and-Leaf DisplaysExample: Show how to display the data 21, 22, 24, 33, 33, 36, 38, 41 in a stem-and-leaf display.Note: If you turn your head sideways to look at the display, it resembles the histogram for the same data.8

9 Displaying Quantitative VariablesQTM1310/ SharpeDisplaying Quantitative VariablesBefore making a histogram or a stem-and-leaf display, the Quantitative Data Condition must be satisfied: the data values are of a quantitative variable whose units are known.Caution: Categorical data cannot be displayed in a histogram or stem-and-leaf display, and quantitative data cannot be displayed in a bar chart or a pie chart.9

10 Shape When describing a distribution, attention should be paid toQTM1310/ SharpeShapeWhen describing a distribution, attention should be paid toits shape,its center, andits spread.We describe the shape of a distribution in terms of its modes, its symmetry, and whether it has any gaps or outlying values.10

11 QTM1310/ SharpeShapeModesPeaks or humps seen in a histogram are called the modes of a distribution.A distribution whose histogram has one main peak is called unimodal, two peaks – bimodal (see figure), three or more – multimodal.11

12 QTM1310/ SharpeShapeModesA distribution whose histogram doesn’t appear to have any mode and in which all the bars are approximately the same height is called uniform.12

13 QTM1310/ SharpeShapeSymmetryA distribution is symmetric if the halves on either side of the center look, at least approximately, like mirror images.13

14 QTM1310/ SharpeShapeSymmetryThe thinner ends of a distribution are called the tails. If one tail stretches out farther than the other, the distribution is said to be skewed to the side of the longer tail. The distribution below is skewed to the right.14

15 QTM1310/ SharpeShapeOutliersAlways be careful to point out the outliers in a distribution: those values that stand off away from the body of the distribution. Outliers …can affect every statistical method we will study.can be the most informative part of your data.may be an error in the data.should be discussed in any conclusions drawn about the data.15

16 QTM1310/ SharpeShapeCharacterizing the shape of a distribution is often a judgment call.Understanding the data and how they arose can help.An honest desire to understand what is happening in the data should guide your decisions.16

17 QTM1310/ SharpeCenterTo find the mean of the variable y, add all the values of the variable and divide that sum by the number of data values, n The mean is a natural summary for unimodal, symmetric distributions.We will use the Greek letter sigma to represent sum, so the equation for finding the mean can be written as shown.The mean is considered to be the balancing point of the distribution.17

18 QTM1310/ SharpeCenterIf a distribution is skewed, contains gaps, or contains outliers, then it might be better to use the median – the value that splits the histogram into two equal areas.The median is found by counting in from the ends of the data until we reach the middle value.The median is said to be resistant because it isn’t affected by unusual observations or by the shape of the distribution.18

19 QTM1310/ SharpeCenterIf a distribution is roughly symmetric, we’d expect the mean and median to be close. The histogram below depicts monthly trading volume of AIG shares (in millions of shares) for the period 2002 to The mean is million shares and the median is million shares.19

20 QTM1310/ SharpeCenterThe median is resistant to unusual observations and to the shape of the distribution.Therefore, the median is usually a better choice for skewed data.The mean is NOT resistant to unusual observations and to the shape of the distribution.When the distribution is unimodal and symmetric, the mean is a natural summary statistic.20

21 Spread of the DistributionQTM1310/ SharpeSpread of the DistributionWe need to determine how spread out the data are because the more the data vary, the less a measure of center can tell us.One simple measure of spread is the range, defined as the difference between the extremes.Range = max – min21

22 Spread of the DistributionQTM1310/ SharpeSpread of the DistributionThe range is a single value and it is not resistant to unusual observations. Concentrating on the middle of the data avoids this problem.The quartiles are the values that frame the middle 50% of the data. One quarter of the data lies below the lower quartile, Q1, and one quarter lies above the third quartile, Q3.The interquartile range (IQR) is defined to be the difference between the two quartile values.IQR = Q3 – Q122

23 Spread of the DistributionQTM1310/ SharpeSpread of the DistributionTaking into account how far each value is from the mean gives a powerful measure of the spread of a distribution.The average of the squared deviations of the values of the variable y from the mean is called the variance and is denoted by s2.23

24 Spread of the DistributionQTM1310/ SharpeSpread of the DistributionThe variance plays an important role in measuring spread, but the units are the square of the original units of the data.Taking the square root of the variance corrects this issue and gives us the standard deviation.24

25 Shape, Center, and Spread – A SummaryQTM1310/ SharpeShape, Center, and Spread – A SummaryWhich measures of center and spread should be used for a distribution?If the shape is skewed, the median and IQR should be reported.If the shape is unimodal and symmetric, the mean and standard deviation and possibly the median and IQR should be reported.25

26 Shape, Center, and Spread – A SummaryQTM1310/ SharpeShape, Center, and Spread – A SummaryIf there are multiple modes, try to determine if the data can be split into separate groups.If there are unusual observations point them out and report the mean and standard deviation with and without the values.Always pair the median with the IQR and the mean with the standard deviation.26

27 Five-Number Summary and BoxplotsQTM1310/ SharpeFive-Number Summary and BoxplotsThe five-number summary of a distribution reports its median, quartiles, and extremes (maximum and minimum).Below is the five-number summary of monthly trading volume of AIG shares (in millions of shares) for the period 2002 to 2007.27

28 Five-Number Summary and BoxplotsQTM1310/ SharpeFive-Number Summary and BoxplotsOnce we have a five-number summary of a variable, we can display that information in a boxplot.A boxplot highlights several features of the distribution of the variable.28

29 Five-Number Summary and BoxplotsQTM1310/ SharpeFive-Number Summary and BoxplotsThe central box shows the middle half of the data, between the quartiles – the height of the box equals the IQR.If the median is roughly centered between the quartiles, then the middle half of the data is roughly symmetric. If it is not centered, the distribution is skewed.The whiskers show skewness as well if they are not roughly the same length.The outliers are displayed individually to keep them out of the way in judging skewness and to display them for special attention.29

30 Five-Number Summary and BoxplotsQTM1310/ SharpeFive-Number Summary and BoxplotsTo make a boxplot:Locate the median and quartiles on an axis and draw a three short lines. For AIG data, approximate values are Q1= 121, median = 136, and Q3 = 82.Then connect the quartile lines to form a box.30

31 Five-Number Summary and BoxplotsQTM1310/ SharpeFive-Number Summary and Boxplots3) Erect (but don’t show in the final plot) “fences” around the main part of the data, placing the upper fence 1.5 IQRs above the upper quartile and the lower fence 1.5 IQRs below the lower quartile.Draw lines (whiskers) from each end of the box up and down to the most extreme data values found within the fences.5) Add any outliers by displaying data values that lie beyond the fences with special symbols.31

32 Five-Number Summary and BoxplotsQTM1310/ SharpeFive-Number Summary and BoxplotsExample: Gretzky Wayne Gretzky scored 50% more points than anyone else who played professional hockey. Here are the number of games Gretzky played during each of his 20 seasons. Create a stem-and-leaf display.80, 80, 80, 80, 80, 80, 81, 82, 82, 79, 79, 78, 78, 74, 74, 73, 70, 64, 48, 4532

33 Five-Number Summary and BoxplotsQTM1310/ SharpeFive-Number Summary and BoxplotsExample (continued): Gretzky Wayne Gretzky scored 50% more points than anyone else who played professional hockey. Here are the number of games Gretzky played during each of his 20 seasons. Create a stem-and-leaf display.33

34 Five-Number Summary and BoxplotsQTM1310/ SharpeFive-Number Summary and BoxplotsExample (continued): Gretzky Wayne Gretzky scored 50% more points than anyone else who played professional hockey. Here are the number of games Gretzky played during each of his 20 seasons. Create a boxplot.80, 80, 80, 80, 80, 80, 81, 82, 82, 79, 79, 78, 78, 74, 74, 73, 70, 64, 48, 4534

35 Five-Number Summary and BoxplotsQTM1310/ SharpeFive-Number Summary and BoxplotsExample (continued): Gretzky Wayne Gretzky scored 50% more points than anyone else who played professional hockey. Here are the number of games Gretzky played during each of his 20 seasons. Create a boxplot.35

36 Five-Number Summary and BoxplotsQTM1310/ SharpeFive-Number Summary and BoxplotsExample (continued): Gretzky Wayne Gretzky scored 50% more points than anyone else who played professional hockey. Here are the number of games Gretzky played during each of his 20 seasons. Describe the distribution. What unusual features do you see?80, 80, 80, 80, 80, 80, 81, 82, 82, 79, 79, 78, 78, 74, 74, 73, 70, 64, 48, 4536

37 Five-Number Summary and BoxplotsQTM1310/ SharpeFive-Number Summary and BoxplotsExample (continued): Gretzky Wayne Gretzky scored 50% more points than anyone else who played professional hockey. Here are the number of games Gretzky played during each of his 20 seasons. Describe the distribution. What unusual features do you see?The distribution of the number of games played per season by Wayne Gretzky is skewed to the left with 2 outliers. He may have been injured during these seasons. The season with 64 games is also separated by a gap. The median is 79 games, the range is 37 games, and the IQR is 6.5 games.37

38 QTM1310/ SharpeComparing GroupsIn attempting to understand data, look for patterns, differences, and trends over different time periods.We can split the data into smaller groups and display histograms for each group. Histograms for AIG data single years (2002 and 2003) are shown below.38

39 QTM1310/ SharpeComparing GroupsHistograms work well for comparing two groups, but boxplots offer better results for side-by-side comparison of several groups.Below the AIG data is displayed in yearly boxplots.39

42 Comparing Groups Example (continued): Wine PricesQTM1310/ SharpeComparing GroupsExample (continued): Wine PricesWrite a few sentences describing these prices.Cayuga Lake vineyards and Seneca Lake have approximately the same average case price of about $200, while a typical Keuka Lake vineyard has a case price of about $260. Keuka Lake vineyards have consistently high case prices, between $240 and $280, with one low outlier at about $170 per case. Cayuga Lake vineyards have case prices from $140 to $270, and Seneca Lake vineyards have highly variable case prices from $100 to $300.42

43 Identifying Outliers What should be done with outliers?QTM1310/ SharpeIdentifying OutliersWhat should be done with outliers?They should be understood in the context of the data. An outlier for a year of data may not be an outlier for the month in which it occurred and vice versa.They should be investigated to determine if they are in error. The values may have simply been entered incorrectly. If a value can be corrected, it should be.They should be investigated to determine why they are so different from the rest of the data. For example, were extra sales or fewer sales seen because of a special event like a holiday.43

44 QTM1310/ SharpeStandardizingTo compare different variables, the values are standardized by measuring how far they are from the mean.We measure the distance from the mean and divide by the standard deviation, and the result is the standardized value. The standardized value tells how many standard deviations each value is above or below the overall mean.44

45 QTM1310/ SharpeStandardizingCompare two companies (from the “top” 100 companies) with respect to the variables Revenue (in $B) and number of Employees.US Foodservice had $19.81B revenue and 26,000 employees. Toys “R” Us had revenues of only $13.72B but 69,000 employees.For all 100 companies, the mean revenue was $6.23 with standard deviation $10.56; the average number of employees was 19,629 and standard deviation 32,055.45

46 QTM1310/ SharpeStandardizingMeasure how far each of our values are by subtracting the mean and then dividing by the standard deviation The resulting value is a standardized value or z-score. A z-score tells how many standard deviations a value is from the mean.For example, a z-score of 2.0 indicates that a data value is two standard deviations above the mean.46

47 QTM1310/ SharpeStandardizingComputing the z-scores for both variables for U.S Foodservice and Toys “R” Us, we obtain the results summarized below.RevenueNumber of EmployeesMean (all companies)SD6.2310.5632,055US Foodservicez-scoreToys “R” UsStandardizing gives us a way to compare variables even when they’re measured in different units.47

48 Standardizing Example: Customer AgesQTM1310/ SharpeStandardizingExample: Customer AgesAs part of a marketing team, you send surveys to customers (using an incentive to guarantee a high response rate) asking for demographic information. The average age of respondents is years , the standard deviation is years, min is 11 years and max is 48 years. Which has the more extreme z-score, the min or the max?48

49 Standardizing Example (continued): Customer AgesQTM1310/ SharpeStandardizingExample (continued): Customer AgesAs part of a marketing team, you send surveys to customers (using an incentive to guarantee a high response rate) asking for demographic information. The average age of respondents is years , the standard deviation is years, min is 11 years and max is 48 years. Which has the more extreme z-score, the min or the max?The minimum is farther below the mean than the max is above the mean. Therefore, the minimum age is more extreme than the maximum age.49

50 QTM1310/ SharpeTime Series PlotsA display of values against time is sometimes called a time series plot. Below we have a time series plot of the AIG daily closing prices in 2007.50

51 QTM1310/ SharpeTime Series PlotsTime series plots often show a great deal of point-to-point variation, but general patterns do emerge from the plot.Time series plots may be drawn with the points connected. Below the AIG data from before is displayed this way.51

52 QTM1310/ SharpeTime Series PlotsTo better understanding the trend of times series data, plot a smooth trace. A trace is typically created using a statistics software package and will be discussed in a later section.The AIG data has been plotted with a smooth trace below.Unless there is strong evidence for doing otherwise, we should resist the temptation to think that any trend we see will continue indefinitely.52

53 QTM1310/ SharpeTime Series PlotsConsider the time series plot for the AIG monthly stock closing price in The histogram showed a symmetric, possibly unimodal distribution.The time series plot shows a period of gently falling prices and then the severe decline in September, followed by very low prices.53

54 QTM1310/ SharpeTime Series PlotsWhen a time series is stationary (without a strong trend or change in variability), then a histogram can provide a useful summary.However, when the time series is not stationary like the AIG prices after 2007, a histogram is unlikely to display much of interest; a time series plot would be more informative.54

55 Transforming Skewed DataQTM1310/ SharpeTransforming Skewed DataExample: Below we display the skewed distribution of total compensation for the CEOs of the 500 largest companies.What is the “center” of this distribution? Are there outliers?55

56 Transforming Skewed DataQTM1310/ SharpeTransforming Skewed DataWhen a distribution is skewed, it can be hard to summarize the data simply with a center and spread, and hard to decide whether the most extreme values are outliers or just part of the stretched-out tail.One way to make a skewed distribution more symmetric is to re-express, or transform, the data by applying a simple function to all the data values.If the distribution is skewed to the right, we often transform using logarithms or square roots; if it is skewed to the left, we may square the data values.56

57 Transforming Skewed DataQTM1310/ SharpeTransforming Skewed DataExample: Below we display the transformed distribution of total compensation for the CEOs of the 500 largest companies.This histogram is much more symmetric, and we see that a typical log compensation is between 6.0 and 7.0 or $1 million and $10 million in the original terms.57

58 QTM1310/ SharpeDon’t make a histogram of a categorical variable The histogram below of policy numbers is not at all informative.58

59 Choose a scale appropriate to the data. QTM1310/ SharpeChoose a scale appropriate to the data.Avoid inconsistent scales. Don’t change scales in the middle of a plot, and compare groups on the same scale.Label variables and axes clearly.Do a reality check. Make sure the calculated summaries make sense.Don’t compute numerical summaries of a categorical variable.59

60 QTM1310/ SharpeWatch out for multiple modes. If the data has multiple modes, consider separating the data.Beware of outliers.60

61 QTM1310/ SharpeWhat Have We Learned?Make and interpret histograms to display the distribution of a variable.• We understand distributions in terms of their shape, center, and spread.61

62 What Have We Learned? Describe the shape of a distribution.QTM1310/ SharpeWhat Have We Learned?Describe the shape of a distribution.• A symmetric distribution has roughly the same shape reflected around the center• A skewed distribution extends farther on one side than on the other.• A unimodal distribution has a single major hump or mode; a bimodal distribution has two; multimodal distributions have more.• Outliers are values that lie far from the rest of the data.62

63 QTM1310/ SharpeWhat Have We Learned?Compute the mean and median of a distribution, and know when it is best to use each to summarize the center.• The mean is the sum of the values divided by the count. It is a suitable summary for unimodal, symmetric distributions.• The median is the middle value; half the values are above and half are below the median. It is a better summary when the distribution is skewed or has outliers.63

64 QTM1310/ SharpeWhat Have We Learned?Compute the standard deviation and interquartile range (IQR), and know when it is best to use each to summarize the spread.• The standard deviation is roughly the square root of the average squared difference between each data value and the mean. It is the summary of choice for the spread of unimodal, symmetric variables.• The IQR is the difference between the quartiles. It is often a better summary of spread for skewed distributions or data with outliers.64

65 QTM1310/ SharpeWhat Have We Learned?Find a five-number summary and, using it, make a boxplot. Use the boxplot’s outlier nomination rule to identify cases that may deserve special attention.• A five-number summary consists of the median, the quartiles, and the extremes of the data.• A boxplot shows the quartiles as the upper and lower ends of a central box, the median as a line across the box, and “whiskers” that extend to the most extreme values that are not nominated as outliers.Boxplots display separately any case that is more than 1.5 IQRs beyond each quartile. These cases should be considered as possible outliers.65

66 What Have We Learned? Use boxplots to compare distributions.QTM1310/ SharpeWhat Have We Learned?Use boxplots to compare distributions.• Boxplots facilitate comparisons of several groups. It is easy to compare centers (medians) and spreads (IQRs).• Because boxplots show possible outliers separately, any outliers don’t affect comparisons.66

67 QTM1310/ SharpeWhat Have We Learned?Standardize values and use them for comparisons of otherwise disparate variables.• We standardize by finding z-scores. To convert a data value to its z-score, subtract the mean and divide by the standard deviation.• z-scores have no units, so they can be compared to z-scores of other variables.• The idea of measuring the distance of a value from the mean in terms of standard deviations is a basic concept in Statistics and will return many times later in the course.67

68 QTM1310/ SharpeWhat Have We Learned?Make and interpret time plots for time series data.• Look for the trend and any changes in the spread of the data over time.68