STATISTICS 926 - HOMEWORK 1
Due Mon 2014/09/22, midnight
DATA VIZ PROJECT: INFERENTIAL DENSITY COMPARISON
----------------------------------------------------------------
Email an R package by the deadline to:
stat926.at.wharton@gmail.com
The subject line should contain the following:
Subject: HW 01, 2014, LastName, FirstName
----------------------------------------------------------------
STATEMENT OF THE PROJECT:
Create an R package that generates plots as shown in the four frames of
http://stat.wharton.upenn.edu/~buja/STAT-926/fig-inferential-density-comparison.pdf
The plots show density comparisons endowed with statistical inference.
As shown they are intended for one quantitative variable and one
two-group factor variable. You should, however, also provide for a
multi-group factor variable with three or more groups (see
instructions at the end).
DETAILS:
Inference is based on permutation testing, calibrated for simultaneity
on a grid consisting of a large but finite number of locations on the
quantitative axis. The black curve below the axis shows the magnitude
of the difference of the densities, i.e., abs(d1.hat-d2.hat). The
gray areas are the "insignificance zones". Wherever the black curve
drops below the zones, there is statistically significant difference
between the densities. The different shades of gray indicate 50%, 10%
and 5% simultaneous (FWER) statistical significance. You should also
provide for a simple spaghetti version of the insignificance zone
whereby null density difference curves are plotted on top of each
other.
How is simultaneity for the density differences on a grid achieved?
The idea is to generate a sensible 1-parameter nested family of
insignificance zones and select from this family those zones that have
an estimated 50%, 90% and 95% simultaneous coverage under the null
hypothesis. The one-parameter family of insignificance zones is given
by a family of functions f(x,t) >= 0, where x is in the grid points
{x1,...,xn} and t is the parameter of the family. The function f(x,t)
must be monotone increasing in t, which grants nestedness of the
insignificance zones:
I(t) = { d=(d1...dn) in R^n : di <= f(xi,t) forall i=1...n }
so that I(t1) is a subset of I(t2) if t1 <= t2.
(Note: 'n' is not the sample size but the number of grid points on
which the densities are evaluated for plotting.)
How do we construct insignificance zones? Here is a slight change
from what was said in class: We are not going to use exact estimated
pointwise quantiles of the permutation distribution; instead, we use
multiples of the pointwise estimated sdev:
f(xi,t) = t * sqrt(E[ (d1.hat(xi) - d2.hat(xi))^2 | H0,... ])
The makes sense because under H0 the population densities are the
same: d1 == d2, hence d1.hat(xi) - d2.hat(xi) are asymptotically
centered under H0. Therefore f(xi,t=1) is an estimate of the sdev of
d1.hat(xi) - d2.hat(xi). If these differences are asymptotically
normal (which they are), then f(xi,t) for t>0 are estimates of
POINTWISE quantiles for 2-sided null coverage Phi(t)-Phi(-t) =
2*Phi(t)-1. Your task will be (1) to estimate f(xi,t=1) from the
permutation distribution and (2) to find the values of t.50, t.90 and
t.95 for which I(t.50) has estimated SIMULTANEOUS null coverage 0.50,
and analogously for I(t.90) and I(t.95).
Operationally, here is what needs to be done:
(1) Simulate N.sim random permutations of the group labels, compute
density estimates using 'density()' for both random groups, making
sure that all calls to this function evaluate the density estimates at
the same grid points. Store the absolute values of the evaluated
density differences in a matrix of size N.sim x N.grid. Estimate
f(xi,t=1) and store it in a vector of size N.grid. Determine the
smallest t.max such that I(t.max) has 100% simultaneous null coverage
under the simulation. Then step down on a rough ladder of multiples
t.max*.99, t.max*.98, ... to find approximate values of t.95, t.90 and
t.50, computing the estimated simultaneous coverage for each of these
multiples. No need to be smarter about the search.
(2) Draw the whole thing as in the four frames of the figure. Provide
parameters in your function so the user can choose between the
depiction of I(t.95), I(t.90) and I(t.50) as gray areas as shown in
the left and bottom frames of the figure, or a crude spagghetti plot
as shown in the top right frame of the figure (this does not require
the ladder search of (1)).
(3) Write two wrapper functions for categorical variables with more
than two values as follows:
(3a) One wrapper should produce a matrix of plots for all pairwise
group comparisons. Since plots (i,j) and (j,i) should show the same
plot, find out how to draw frames out of order so you are not
computing the same thing twice.
(3b) The second wrapper should produce a series of plots for the
comparison of each group against the union of the rest. Try to be
minimally smart about how you arrange the series of plots. You may
provide a parameter for the maximum number of plots in a row and
default it to 5.
Use good aesthetics: All frames should be approximately square or only
slightly flat Margins should not be wasteful. Titles should be
informative.
Use a dataset of your choice for illustration and put it in your
package if it is not generally available in R distributions.
================================================================