Lecture

Descriptive and exploratory statistics
Garib Murshudov
Contents
1.
2.
3.
4.
5.
Itroduction
Location
Spread
Various plots: plots depend on data
Histograms, cumulative distributions and all
that
Purpose of descriptive statistics and
various plots
•
•
•
•
•
•
maximize insight into a data set
uncover underlying structure
extract important variables
detect outliers and anomalies
test underlying assumptions
In general: build intuition about the data and problem
Descriptive statistics
There are two types of the simplest numerical descriptors of a data sets:
1) Values describing location – mean, median, mode
2) Values describing spread – variance, interquartile range
Descritpive statistics
Histogram of aa
Histogram of bb
1000
400
200
0
0
Skewness and kurtosis for normal
distribution are zero.
600
Frequency
600
400
0
200
Frequency
Kurtosis is a measure of tail: Positive kurtosis:
heavier tailthan normal distribution and negative
kurtosis:lighter tail than normal distribution
800
Skewness – it is a measure of symmetry of the
distribution. Positive skewness: right tail is fatter
and negative skewness: left tail is fatter
800
1000
Additional descriptive numerical values:
2
4
6
8
aa
Skewness = 1.18
Kurtosis = 2.33
2
4
6
8
bb
Skewness = -1.18
Kurtosis = 2.33
10
Location
0.04
0.02
0.00
Density
Example: mean = 9.91
median = 9.92
mode = 9.98
0.06
0.08
The simplest information about a data set is about its location. There are three
different location parameters: average, median and mode:
1) Average = sum(data)/Ndata
2) Median: proportion of data more than median is equal to that of less than
median
Histogram of aa
3) Mode: the most occurring data point.
-10
0
10
aa
20
30
Location
Average is very sensitive to few outliers. If we change one value of data arbitrarily
then we can affect average value substantially. However median is not affected very
much Example:
13.2
13.2
8.2 10.9 14.3 10.7
8.2 10.9 74.3 10.7
6.6
6.6
9.5 10.8
9.5 10.8
8.8 13.3
8.8 13.3
- Av = 10.63, median = 10.75
- Av = 16.63, median = 10.75
Breakdown point of average is 0, breakdown point of median is 0.5.
I.e. you have to change 50% of the data dramatically to affect the median.
Median is the most robust estimator
Average is the most convenient estimator with nice properties
If sample is small then mode it may be impossible to estimate mode.
Wikidictionary: The number or proportion of arbitrarily large or small extreme values that
must be introduced into a batch or sample to cause the estimator to yield an arbitrarily
large result.
Simpson’s paradox: batting averages
One should be careful in dealing with averages. The most famous paradox
related to averages is Simpson’s paradox.
Runs
Outs
Average
1st Ashes
MW
270
6
45
SW
500
10
50
2nd Ashes
MW
700
10
70
SW
320
4
80
Total
MW
970
16
60.25
SW
820
14
58.57
MW – Mark Waugh
SW – Steve Waugh
Spread
0.3
0.4
Histogram of rn
0.2
Density
There are two main indicators of spread of a data set
1) Standard deviation (=(var)1/2). It is a usual
indicator of spread. Very easy to calculate. But it
is not robust to outliers. One outlier is sufficient
to corrupt standard deviation
0.0
0.1
2) Interquartile range - IQR: 50% of the data are
within first and third quartile of the data. This
indicator is more robust. You need to corrupt 25%
of the data to corrupt IQR
0
2
4
6
rn
Black vertical lines – quartiles
Blue vertical lines - mean+sd, mena-sd
Spread: robustness
Average is very sensitive to few outliers. If we change one value of data arbitrarily
then we can affect average value substantially. However median is not affected very
much Example:
13.2
13.2
8.2 10.9 14.3 10.7
8.2 10.9 74.3 10.7
6.6
6.6
9.5 10.8
9.5 10.8
8.8 13.3
8.8 13.3
- sd = 2.45, IQR = 3.65
- sd = 20.37, IQR = 3.65
Breakdown point of sd is 0, breakdown point of median is 0.25.
I.e. you have to change at 25% of the data dramatically to affect IQR.
IQR is the much more robust than sd
Wikidictionary: The number or proportion of arbitrarily large or small extreme values that
must be introduced into a batch or sample to cause the estimator to yield an arbitrarily
large result.
Tukey’s five number summaries
One of the important books on statistical data analysis is:
Tukey, JW. (1977) Exploratory data analysis
After this book there was explosion of exploratory data analysis. I.e. visualisation of
datasets and modelling based on visual analysis.
One of the suggestions in this book is five number summary of data sets. Essentially
these numbers are (although in Tukey’s book different numbers are suggested):
Minimum, 1st quartile, median, 3rd quartile, maximum. These numbers are
calculated by R with the command summary.
For example:
A = 13.2 8.2 10.9 14.3 10.7 6.6
summary(A)
Min. 1st Qu. Median
Mean
6.600
8.975
10.750
10.630
9.5 10.8
8.8 13.3
3rd Qu.
12.620
Max.
14.300
Various plots
In general data visualisation is dependent on the type of data and the system it
comes from.
For some of the data sets it can be suggested to use some general plots. These
include:
1) Box and whisker plot – boxplot
2) Histograms
3) Cumulative distribution plots
4) QQ plots
12
10
8
20
18
16
14
12
10
8
Boxplots are convenient ways to visualise one dimensional
data. It shows minimum maximum, first quartile, median and
third quartile – visual representation of five number
summary. This plot can indicate if the distribution of the data
is symmetric. This plot may indicate outliers – if one of the
points is too different from others – e.g. it is outside the
interval (median + 2*IQR)
14
Boxplots
Side by side boxplot
0
5
10
15
20
25
Boxplots can be used visual comparison of data,
e.g. effects of different treatments.
A
B
C
D
E
Effect of different insecticides
F
Boxplots
Boxplots are just a schematic plots. Sometimes they must mask out some of the
features of the data. Classic example is Lord Rayleigh’s data on measurement of
densities of nitrogen derived from different sources which lead to the discovery of
Argon.
Rayleigh was led into the investigation by small anomalies he found in
measurements of the density of nitrogen purified by different methods. Those
different methods led to different quantities of nitrogen, and thus to different
proportions of nitrogen and a hitherto unsuspected atmospheric gas. Argon was the
first noble gas isolated. Ramsay's subsequent work isolated helium and discovered
neon, krypton, and xenon by the end of the century. Ramsay and Rayleigh were
awarded Nobel Prizes in 1904. Rayleigh was awarded the physics prize for argon,
while Ramsay was awarded the chemistry prize for argon and the family of noble
gases.
2.304
2.302
2.300
2.298
2.300
2.302
2.304
2.306
2.308
2.310
2.298
2.310
2.308
2.306
2.304
Nitrogen
2.302
2.300
2.298
2.308
2.306
2.310
2.308
2.306
2
4
6
8 10
2.304
2.302
2.300
2.298
2.300
2.302
2.304
Index
2.298
Nitrogen
2.310
By looking at the boxplot we do not see any peculiarity in the data.
However one can notice that whiskers are very close to the edges of
the box, i.e. minimum and maximum are close to first and third
quartile respectively. When you see that then you should be
suspicious about the data.
If we do side by side plot of scatter (dot) plot and boxplot we see
peculiar behaviour. There seem to be two classes.
Let us use boxplot for different sources of nitrogen. There is definitely
two classes. One derived from air and another from other sources.
2.306
2.308
2.310
Boxplots
2
4
6
8 10
Index
14
Air
NoAir
14
Insectsprays revisited
25
20
15
0
5
10
15
10
5
0
N Insects
20
25
If we do side by side scatter and boxplot of Insectsparys data we see that there is some
peculiarity for spray F. I do not know the reason but it may be interesting to
investigate if you see something like that in your data.
0 10
30
Index
50
70
A
B
C
D
E
F
Histograms
Histogram of rn1
5
10
rn1
Histogram of rn2
60
0
5
rn1
10
15
0
0
0
0
50
20
40
Frequency
150
Frequency
100
3000
Frequency
1000
2000
6000
4000
2000
Frequency
8000
4000
200
10000
80
Histogram of rn2
0
-5
10 20 30 40 50 60
rn2
0
5
rn1
Scott DW, Multivariate Density Estimation
10
0
10
20
30
rn2
40
0
10
20
30
rn2
Nbin=50
Histogram of rn1
Histogram of rn1
400
500
0
Nbin=500
Nbin=5
300
Frequency
0
0
100
500
200
1000
Frequency
1500
600
2000
Histograms are good way of visualisation of 1D
data (there are high dimensional versions also).
If there are enough data points then histograms
may indicate the potential distribution,
multimodality, skewness.
For visually pleasing histograms number of bins
to calculate histograms is important. Too many
bins might be very noisy, too few bins can mask
out important features.
Histogram of rn2
50
40
50
Cumulative frequency (probability) plot
Histograms represent density of probability distribution. To plot histograms we
must divide the range of data into bins and then count the number of data
points in each bin (for bin number n we need to count the number of data
points obeying this: xi ≤ y < xi+1 where xi is the bin boundary and y is the
observation). For each bin we may have very small number of data points and
therefore their variation may be large resulting in noisy histograms.
Cumulative frequency (probability) plots are another way of representing data.
In this case we count the number of data points below given point (all y for
which y < xi). As we see the number of data points become larger and larger as
xi approches to the maximum value of the data points.
P
Cumulative frequency plots
Cumulative distributions may indicate if the data points have normal distribution
or heavy tail or some other peculiarities. These plots can also help to select
appropriate distribution. However these plots are hard to interpret by their own
-3 -2
-1
0
x
1
2
3
0.8
0.6
Fn(x)
0.4
0.2
0.0
0.2
0.4
Fn(x)
0.6
0.8
1.0
ecdf(rn2)
0.0
Data standardisation:
y = (x-mean(x))/sd(x)
0.0
0.2
0.4
Fn(x)
0.6
0.8
1.0
One way of comparing two distribution
would be plotting them on the same plot.
To do this we need at least standardise the
data. Even after standardisation the range
of the data can be very different.
ecdf(rn1)
1.0
ecdf(rn)
-2
-1
0
1
x
2
3
-2
0
2
x
4
6
QQ plots
Quantile-quantile plots are useful when testing distributions assumptions. These
plots could indicate if two data sets are from the same distribution, if yes then they
can help to transfer linearly one of them into another one.
Mathematically: let us say that X is from the distribution with cumulative
distribution function (CDF) – F(x) and Y has the distributions G(y). Then by solving:
G(y) = F(x)  y = G-1(F(x)) we can find relationship between y and x. As it can be
seen random variables can be converted from one to another using QQ plots.
For example if x is from exponential distribution – F(x) = 1 – exp(-lambda x) and y is
from uniform distribution in the interval (a,b): G(y) = (y-a)/(b-a) then we need to
solve:
(y-a)/(b-a) = 1-exp(-lambda x)  y = b – (b-a) exp(-lambda x), if we see exponential
function then we may have this particular relationship.
QQ plots
Example: uniform and exponential distributions
Empirical
0.4
0.6
0.8
0.0
0.0
0.2
0.2
0.4
ru
1 - exp(-3 * xx)
0.6
0.8
1.0
1.0
Theoretical
0.0
0.5
1.0
1.5
r1
2.0
2.5
0.0 0.5 1.0 1.5 2.0 2.5
xx
QQ norm
QQnorm is the special case of QQ plot – it is a quantile quantile plot against normal
distribution.
QQ norm can already indicate some properties of the data.
1) Outliers:
Normal
Too large value
Too small value
Normal Q-Q Plot
Normal Q-Q Plot
2
Sample Quantiles
-2
Sample Quantiles
-4
0
-6
0
-1
-2
-2
-8
-3
Sample Quantiles
4
1
0
6
2
2
8
3
Normal Q-Q Plot
-3
-2
-1
0
1
2
Theoretical Quantiles
3
-3
-2
-1
0
1
2
Theoretical Quantiles
3
-3
-2
-1
0
1
2
Theoretical Quantiles
3
QQ norm
2) Bimodality, skewness (note that small curviture for small and large values can be expected)
Normal
Skewed to left
Bimodal
Normal Q-Q Plot
Normal Q-Q Plot
3
3
Normal Q-Q Plot
Convex
0
2
1
Sample Quantiles
1
Sample Quantiles
0
0
-1
-1
-1
-2
-3
Sample Quantiles
3
1
4
2
2
5
Curviture
-3
-2
-1
0
1
2
Theoretical Quantiles
3
-3 -2 -1
0
1
2
Theoretical Quantiles
3
-3
-2
-1
0
1
2
Theoretical Quantiles
3
QQ norm
3) Heavier tail
t distribution,
df = 3
Normal
Histogram of rtt
Normal Q-Q Plot
Heavier tails
0
Sample Quantiles
40
-2
60
Frequency
100
0
-6
20
-4
50
0
Frequency
80
2
150
4
100
6
120
200
Histogram of rn
-3
-2
-1
0
rn
1
2
3
-6 -4 -2
0
rtt
2
4
6
-3
-2
-1
0
1
2
Theoretical Quantiles
3
QQ norm
If distribution of two random variables have the same form then we may derive
linear transformation of data vs another one.
-30
-20
-10
0
rn1
10
20
30
Qqplot: one data vs another. Slope and intercept of the line gives linear transformation needed:
y = a + bx and a = 3, b=10
-3
-2
-1
0
rn
1
2
3
Conclusions
• Average and variance are usual measures for location and spread of the
data. However they are not robust. Median and IQR are more rbust
• Boxplot is good way of summarising data, however it might mask out
features of the data
• QQ plot can be used to check distribution assumptions
References
• Tukey, JW. Exploratory data analysis
• Scott DW, Multivariate Density Estimation