Exploratory Data Analysis

Exploratory Data Analysis

“The greatest value of a picture is when it forces us to notice what we never expected to see.” -John W. Tukey

Biases, systematic errors and unexpected variability are common in
data from the life sciences. Failure to discover these problems often
leads to flawed analyses and false discoveries. As an example,
consider that experiments sometimes fail and not all data processing
pipelines, such as the t.test function in R, are designed to detect
these. Yet, these pipelines still give you an answer. Furthermore, it
may be hard or impossible to notice an error was made just from the
reported results.

Graphing data is a powerful approach to detecting these problems. We
refer to this as exploratory data analysis (EDA). Many important
methodological contributions to existing techniques in
data analysis were initiated by discoveries made via EDA.
In addition, EDA can lead to interesting biological discoveries which
would otherwise be missed through simply subjecting the data to a
battery of hypothesis tests.
Through this book, we make use of exploratory plots to motivate the analyses we
choose. Here we present a general introduction to EDA using height
data.

We have already introduced some EDA approaches for univariate data,
namely the histograms and qq-plot. Here we describe the qq-plot in more detail and
some EDA and summary statistics for paired data. We also give a
demonstration of commonly used figures that we recommend against.

Quantile Quantile Plots

To corroborate that a theoretical distribution, for example the normal distribution, is in fact a good
approximation, we can use quantile-quantile plots
(qq-plots). Quantiles are best understood by considering the special
case of percentiles. The p-th percentile of a list of a distribution
is defined as the number q that is bigger than p% of numbers (so the inverse of
the cumulative distribution function we defined earlier). For
example, the median 50-th percentile is the median.
We can compute the percentiles for our list of heights:

library(UsingR)##available from CRANlibrary(rafalib)x<-father.son$fheight

Note how close these values are. Also, note that we can see these
qq-plots with less code (this plot has more points than the one we
constructed manually, and so tail-behavior can be seen more clearly).

qqnorm(x)qqline(x)

However, the qqnorm function plots against a standard normal distribution. This is why the line has slope popsd(x) and intercept mean(x).

In the example above, the points match the line very well. In fact, we can run Monte Carlo simulations to see plots like this for data known to be normally distributed.

n<-1000x<-rnorm(n)qqnorm(x)qqline(x)

We can also get a sense for how non-normally distributed data will
look in a qq-plot. Here we generate data from the t-distribution with
different degrees of freedom. Notice that the smaller the degrees of
freedom, the fatter the tails. We call these “fat tails” because if we
plotted an empirical density or histogram, the density at the extremes would be
higher than the theoretical curve. In the qq-plot, this can be seen in
that the curve is lower than the identity line on the left
side and higher on the right side. This means that there are more
extreme values than predicted by the theoretical density plotted on
the x-axis.

Boxplots

Data is not always normally distributed. Income is a widely cited
example. In these cases, the average and standard deviation are not
necessarily informative since one can’t infer the distribution from
just these two numbers. The properties described above are specific to
the normal. For example, the normal distribution does not seem to be a
good approximation for the direct compensation for 199 United States
CEOs in the year 2000.

In addition to qq-plots, a practical summary of data is to compute 3
percentiles: 25-th, 50-th (the median) and the 75-th. A boxplot shows
these 3 values along with a
range of the points within median 1.5 (75-th percentile -
25th-percentile). Values outside this range are shown as points and
sometimes refereed to as outliers.

boxplot(exec.pay,ylab="10,000s of dollars",ylim=c(0,400))

Here we show just one boxplot.
However, one of the great benefits of boxplots is that we could easily
show many distributions in one plot, by lining them up, side by side. We will see several examples of this throughout the book.

Scatterplots And Correlation

The methods described above relate to univariate variables. In the
biomedical sciences, it is common to be interested in the relationship
between two or more variables. A classic example is the father/son
height data used by
Francis Galton to
understand heredity. If we were to summarize these data, we could use
the two averages and two standard deviations since both distributions are
well approximated by the normal distribution. This summary, however,
fails to describe an important characteristic of the data.

The scatter plot shows a general trend: the taller the father, the
taller the son. A summary of this trend is the correlation coefficient,
which in this cases is 0.5. We will motivate this statistic by trying to
predict the son’s height using the father’s height.

Stratification

Suppose we are asked to guess the height of randomly select sons. The
average height, 68.7 inches, is the value with the highest proportion
(see histogram) and would be our prediction. But what if we are told
that the father is 72 inches tall, do we sill guess 68.7?

The father is taller than average. Specifically, he is 1.75
standard deviations taller than the average father. So should we
predict that the son is also 1.75 standard deviations taller? It turns
out that this would be an overestimate. To see this, we look at all the sons
with fathers who are about 72 inches. We do this by stratifying the
father heights.

groups<-split(y,round(x))boxplot(groups)

print(mean(y[round(x)==72]))

## [1] 70.67719

Stratification followed by boxplots lets us see the distribution of
each group. The average height of sons with fathers that are 72 inches
tall is 70.7 inches. We also see that the medians of the strata appear
to follow a straight line (remember the middle line in the boxplot
shows the median, not the mean). This line is similar to the regression
line, with a slope that is related to the correlation, as we will learn
below.

Bi-variate Normal Distribution

A pair of random variables is considered to be approximated by
bivariate normal when the proportion of values below, for example
and , is approximated by this expression:

This may seem like a rather complicated equation, but the concept behind it is rather intuitive. An alternative definition is the following: fix a value
and look at all the pairs for which . Generally, in
statistics we call this exercise conditioning. We are conditioning
on . If a pair of random variables is approximated by a
bivariate normal distribution, then the distribution of conditioned
on is approximated with a normal distribution, no matter what we choose. Let’s
see if this holds with our height data. We show 4 different strata:

Now we come back to defining correlation. Mathematical statistics tells us that when two variables follow a bivariate normal distribution, then for any given value of , the average of the in pairs for which is:

Note that this is a line with slope . This is referred to as the regression line. If the SDs are the same, then the slope of the regression line is the correlation . Therefore, if we standardize and , the correlation is the slope of the regression line.

Another way to see this is to form a prediction : for every SD away from the mean in , we predict SDs away for :

If there is perfect correlation, we predict the same number of SDs. If there is 0 correlation, then we don’t use at all. For values between 0 and 1, the prediction is somewhere in between. For negative values, we simply predict in the opposite direction.

To confirm that the above approximations hold in this case, let’s compare the mean of each strata to the identity line and the regression line:

x=(x-mean(x))/sd(x)y=(y-mean(y))/sd(y)means=tapply(y,round(x*4)/4,mean)fatherheights=as.numeric(names(means))mypar(1,1)plot(fatherheights,means,ylab="average of strata of son heights",ylim=range(fatherheights))abline(0,cor(x,y))

Variance explained

The standard deviation of the conditional distribution described above is:

This is where statements like explains % of the variation in : the variance of is and, once we condition, it goes down to . It is important to remember that the “variance explained” statement only makes sense when the data is approximated by a bivariate normal distribution.