================================================================
* SIMULTANEOUS INFERENCE IN LINEAR MODELS: MAX-t AND SCHEFFE
(Background: Ioannidis, irreproducible research; Simmons, Nelson, Simonsohn)
on our website:
site Problems of simultaneous/multiple inference
- Common approaches to the problem are based on two fundamental concepts:
. family-wise error rates (FWER)
. false discovery rates (FDR)
We will apply the notion of FWER to linear models.
FWER = simultaneous Type I error probability
FDR is more appropriate in situations of many independent tests;
it is harder to apply in regression where tests are dependent.
(Exception: designed experiments with orthogonal predictors.)
- Meanings of FWER:
. Testing: The probability (rate) of making ANY false rejection
among a number of tests is (no more than) alpha (=FWER).
. Confidence: The probability (rate) of catching ALL true parameter
values among a number of CIs is (at least) 1-alpha (=1-FWER).
- We will consider FWER control based on the following approaches:
. Bonferroni
. exact control under independence
. max-t / min-pval simultaneous inference
. Scheffe simultaneous inference
- Bonferroni: a rough-and-ready rule
. Example: Consider a regression with p predictors.
Assume that we want our inferences to be simultaneouly controlled
at 5\% for all p t-statistics.
. In general:
We would like a significance level alpha (=0.05) SIMULTANEOUSLY
for false rejection in any among q tests.
H0.sim: beta1=beta1.0 & ... & betaq = betaq.0
^^^^^^^^^^^^^ ^^^^^^^^^^^^^^^
H0.1 & ... & H0.q # H = null hypotheses, not hat matrices!
Test statistics: tj = (bj - betaj.0) / SE.hat(bj) (as usual)
FWER = P[ any false rejection j=1,...,q | H0.sim ]
= P[ |t1| > c or ... or |tq| > c | H0.sim ]
<= P[ |t1| > c | H0.sim ] + ... + P[ |tq| > c | H0.sim ]
<= q * max_{j=1..q} P[ |tj| > c | H0.j ] (******************) see below
= q * alpha.ind (alpha.marg = marginal/one-at-a-time/individual Type I Error)
PS: The step to (*****************) requires some thought,
one of the subtleties in this area.
Recall: For any betaj, the other betak are nuisance parameters.
The step to (*****************) should be considered an inequality, not an equality,
and not just for the reason that there is a 'max_{j=1..q}'.
There is also the difference between H0.sim and H0.j !
Both are composite hypotheses, but H0.sim is a subset of H0.j,
because H0.sim specifies the other parameters to have specific values,
whereas H0.j does not.
Therefore, one should read
P[ |tj| > c | H0.j ]
as follows:
P[ |tj| > c | H0.j ] =
max_{betak.0 arbitrary for all k (!= j)}
P[ |tj| > c | betaj=betaj.0, betak=betak.0 for k != j ]
because the test of betaj should be at level alpha for any setting of betak (k != j).
However, one also should read
P[ |tj| > c | H0.sim ]
as follows:
P[ |tj| > c | H0.sim ] =
P[ |tj| > c | betaj=betaj.0, betak=betak.0 for k != j ]
Thus we have
P[ |tj| > c | H0.sim ] <= P[ |tj| > c | H0.j ] (******************)
The pivotal nature of the t-tests reliefs us of this
concern: They are level alpha for all nuisance settings,
hence we have equality in (******************).
[Being careless about the nature of composite null
hypotheses can lead to thinking errors and incorrect
inference methods.]
[Further clarifications:
- A model is set of dataset distributions.
- A parametric model is a set of dataset distributions parametrized by
a finite-dimensional vector such as (beta0,beta1,...,betap,sigma^2).
- Example: normal linear model
M = { P(dy1,dy2,...,dyN): P = N(sum_{j=0...p} xj betaj, sigma^2 I_N) }
- A null hypothesis is a subset of the model M, usually given by a restriction
on the model parameters:
H0.j = { P: betaj=0 }
= { P(dy1,dy2,...,dyN): P = N(sum_{k=0...p,!=j} xk betak, sigma^2 I_N) }
H0.sim = {P: beta1=beta2=...=betap=0 }
= { P(dy1,dy2,...,dyN): P = N(x0 betak0, sigma^2 I_N) } (x0=e=(1,...,1))
= H0.1 intersect H0.2 intersect ... intersect H0.p
- Type I error control really means controlling
max_{P in H0.j} P[|tj| > c] < alpha
- The reason for not needing ' max_{P in H0.j}' in regression t-tests
is that the t-statistic is pivotal:
its null distribution does not depend on the nuisance parameters,
or, equivalently, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
the rejection probabilities P[|tj| > c] are the same for all P in H0.j !!
] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Consequence: If we determine the cut-off 'c' in individual tests such that
---------------------------------
| |
| alpha.ind = alpha / q | Bonferroni rule:
| | universally conservative
---------------------------------
then we control the FWER CONSERVATIVELY at level alpha.
. Bonferroni rule: Test individually/marginally at level alpha/q,
and you will test family-wise at level alpha.
. Q: When is the Bonferroni rule too conservative?
A: ...
Ex.: Drastic, but makes the point...
Assume the q tests are identical.
I.e., when they reject/retain together 100\% of times.
==> They are really just one test masquerading as q tests.
Hence:
P[ false rejection in all tests ]
= P[ false rejection in one test ]
= alpha
No adjustment for simultaneity is needed.
Generalizing: The more the tests are correlated,
the less adjustment for simultaneity is needed.
- Exact control under independence:
. In what follows we consider a situation as in regression when sigma is known
because we want to analyze the case of independent test statistics.
Q: Why can''t the t-statistics in a regression be independent?
A: ...
==> Assuming sigma known means working with z- rather than t-stats.
This is a good approximation to the case when N is VERY large
compared to p.
. Assume the z-statistics z1,...,zq are independent.
In regression this is the case when the predictors X1,...,Xq are orthogonal,
or if we use orthogonalized predictors: X1, X2.1, X3.12, X4.123, ... Xq.1.2...{q-1}
1 - FWER = P[ no false rejection | H0(1) & ... & H0(q) ]
= P[ |z1| <= c and ... and |zq| <= c | H0(1) & ... & H0(q) ]
= Product_{j=1...q} P[ |zj| <= c ]
= Product_{j=1...q} (1-alpha) # If all tests are performed at level alpha.
= (1-alpha)^q
Therefore, if we determine 'C' in individual tests such that
----------------------------------------------
| |
| P[ |z| > C ] = 1 - (1 - alpha)^{1/q} | valid under independence of tests
| |
----------------------------------------------
then we control the FWER EXACTLY at level alpha.
. Justification of Bonferroni as a good rule under independence:
(1 - alpha)^{1/q} <= 1 - alpha/q (1st order Taylor at alpha=0, concavity of x^{1/q})
1 - (1 - alpha)^{1/q} >= 1 - (1 - alpha/q) = alpha/q
==> alpha/q is a conservative per-test level
to achieve simultaneous level alpha.
Note, though: Bonferroni is not exact in this case either;
it is just a good approximation.
- Dependent test: The max-|t| approach
. Applies when the tests are correlated.
Consider q 2-sided t-tests IN A MULTIPLE REGRESSION. The FWER is
P[ any |tj| > c | H0 ] = P[ max_j |tj| > c | H0 ]
Call 'max_j |tj|' the 'max-|t|' statistic.
It is a statistic in its own right.
The problem is we don''t know its null distribution.
However, we can easily approximate it by simulation:
. Simulation recipe for FWER control:
+ Simulate 9,999 values of max-|t|.
Make sure to use the given predictors X to
capture the correlations among the tj''s:
tj = / s
+ Store these 9,999 values, sort them, and pick the
9,500''th value as an estimate of the cut-off 'C' for which
P[ max |tj| > C ] ~=~ 0.05
PS: The simulation cost depends on p only, not on N:
One can pre-project all xj./|xj.| onto the
column space using an o.n. basis thereof, and
thereafter sample rnorm(p+1) instead of rnorm(N).
Also, one draw from 's' can be simulated as
sqrt(rchisq(1,df=N-p-1) / (N-p-1)).
Due to pivotality wrt sigma you can choose sigma=1 wlog.
. Executing max-|t| inference in R: 'Boston Housing data'
# Repeat from Chapter 1: reading the data
site The columns of 'X.adj' contain the unit vectors xj.adj/|xj.adj|
# All of this was preparation for the simulation.
# Here is an example of simulating one set of t-statistics:
t.stats 2.86
# at FWER=0.01 if |tj| > 3.36
# - construct family-wise 95% CIs as [bj +- 2.86*SE[bj]]
# family-wise 99% CIs as [bj +- 3.36*SE[bj]]
# Reverse question: If you used the conventional cut-off,
qt(0.975, df=N-p-1)
## [1] 1.964797
# what would the FWER error rate be?
sum(t.max > qt(0.975, df=N-p-1))/(Nsim+1)
## [1] 0.44328
# ==> For over 44% of datasets there would be a false rejection
# under the simultaneous null hypothesis.
# Here the results for the actual data:
summary(lm(MEDV ~ ., data=boston))
# ==> At FWER=0.05 there is no ambiguity.
# At FWER=0.01 several predictors are insignificant: CRIM, CHAS, TAX
# whereas ZN, B are just barely significant.
- The min-pvalue approach to FWER control:
. Example of a novel question that arises in practice:
Sequential testing when acquiring more and more data.
What problem would we face if we acquired the Boston Housing data one
census tract after another and tested each time?
. Consider the following simplified setup:
We have a single predictor, and inference is sought for its slope.
The data are acquired sequentially in groups of cases, and a t-test is
performed after each time an additional group of cases is acquired.
You can imagine the 'investigator' stops testing once the p-value falls
below 0.05.
. Technical difficulty:
o In the above example we developed simultaneous inference for
all 13 slopes in the Boston Housing data. We did this with a max-|t|
approach. Note all t-stats had the same dfs.
o Max-|t| doesn''t work in sequential testing because the degrees of
freedom keep increasing as we get more data.
This raises a general problem:
Q: How can we make different types of test statistics comparable?
For example, we may want simultaneous inference for
{t with 10 dfs, and t with 20 dfs} or
{chisq with 5 dfs, and F with 3,97 dfs} ?
A: Translate the test statistics to their p-values.
p-values have the virtue of being on a probability scale.
'pval <= 0.05' means the same for t-stats, F-stats, chisq-stats,...
==> Instead of max-|t|, use min-pvalue.
----------------------------------------------------------
| The minimum-p-value aproach to simultaneous inference |
----------------------------------------------------------
## - Application to the sequential testing problem, just ONE predictor:
Nsim Try n= all multiples of 10
x The majority of dips below 0.05 occur for small samples.
# This makes sense: small samples (n=10,20,...) have greater sampling variability
# than large samples (n=1000, 990, 980, ...),
# hence false rejections are more likely for small n.
#
# How many p-values below 0.05 per dataset?
tab 0 dip 0.05 in 6247 datasets out of 9999, 1 dip in 845 datasets, but also 98 dips in one dataset
tab0 To achieve FWER = 0.05 across the sequential tests from n=10 to n=N=1000 in steps of 10,
# each of the tests should be performed at per-test significance level
# of about 0.0044.
# Bonferroni would be too conservative due to the strong dependence between successive tests:
# requires alpha = 0.05/length(Ns) = 0.05/100 = 0.0005.
# The effective Bonferroni divisor should be:
0.05 / quantile(pval.min, probs=0.05)
# or about 11, not 100.
#
# INTERESTING QUESTION: Can we think of 11 as the appropriate degrees of freedom for
# sequential data acquisition carried out with the above specifications?
. Messages from the exercise of sequential testing:
o The min-pvalue method of FWER-adjustment works always,
never mind how heterogeneous your test statistics are.
The beauty of the p-value scale is that it is a uniform
probability scale across which different tests can be compared.
o Reminder, general fact:
---------------------------------------------------------
| If the random variable X has a continuous cdf F, |
| then both F(X) and 1-F(X) have U[0,1] distributions. |
---------------------------------------------------------
Therefore p-values are U[0,1] distributed under their H0s.
This is what the p-value does: It passes the null distribution
of the test statistic through the complement cdf 1-F.
So if H0 is true, then the p-value should be U[0,1] distributed.
[There is a messy detail for 2-sided tests: For example,
|t| has a folded t-distribution on [0,Inf).
But the principle still applies: 1-F(t) = P[|T|>t),
where F(t) = P[|T|<=t] is the cdf of |T|.
hence: p-value = 1-F(|T|) ~ U[0,1]
]
o Remember this:
------------------------------------------------------
| If you select the smallest among several p-values, |
| it is no longer a p-value. |
------------------------------------------------------
o The sequential procedure that consists of collecting more and
more data till one achieves significance is an example of
what Simmons, Nelson and Simonsohn call
----------------------
| 'p-value hacking' |
----------------------
(Simmons and Simonsohn are in our OID dept; Nelson is at UCB.)
o SN&S write that FWER-adjustment for p-value hacking is not feasible.
They probably have Bonferroni in mind. In fact, it would be silly
to adjust the above exercise with Bonferroni: 0.05/100 = 0.0005.
Why? Because successive tests are VERY HIGHLY CORRELATED.
However, we have shown with the above simulation that indeed
------------------------------------------------------
| FWER-ADJUSTMENT FOR P-VALUE HACKING IS POSSIBLE. |
------------------------------------------------------
o S&S developed a diagnostic for p-value hacking:
P-value hackers are not really fishing for the smallest p-value,
they just wait and juggle the data till they see p-values <= 0.05.
This should create a preponderance of p-values just below 0.05,
which is exactly what we saw in the histogram of dipping p-values:
hist(pval.05.vals, breaks=100, col="gray")
The S&S diagnostic essentially histograms the reported p-values <= 0.05
and if they cluster just below 0.05, S&S say there is evidence of
p-value hacking.
- ADJUSTMENT OF OBSERVED P-VALUES FOR SIMULTANEITY:
. We can ask how the raw p-values need to be adjusted to
reflect the fact that they are one among several p-values,
each of which could have been small due to chance because
the minimum of several p-values is <= than each.
. Q: Using the Bonferroni rule, how would you adjust p-values?
A: multiply by the p-values by the number of tests, but cap at 1.
Example: Bonferroni p-value adjustment for the 14 regression coeffs
in the Boston Housing data
pvals.raw.theory |t|)"]
pvals.adj.bonf |t|)"]
Nsim |t|)"]
)
} ); cat("\n")
# [More concise code would use replicate() instead of sapply(), but it can't do intermediate printing.]
#
# Here the actual adjustments: Count the fraction of simulated min-pvalues below observed p-values:
pvals.adj.theory Ratios of 14 mean that Bonferroni is fine.
# When the adjusted p-value is 1, the ratio is meaningless.
# For this dataset simulation-based adjustment produces
# relatively little benefit over Bonferroni adjustment:
# Simulation adjustment gives benefits only in 5 cases,
# and even for them the benefits is no betters than 77%
# of Bonferroni.
. Sometimes there is no theoretical p-value available.
This can happen when the test statistics you want to use
have no known null distribution.
However, you are learning the null distribution approximately
as you perform a null simulation. Shouldn''t you be able to
perform simultaneous inference based on the min-pvalue approach?
In fact you can. Here is how:
Collect the values of all test statistics in a matrix!
This can be memory-intensive, but it this is about the principle.
We do this exercise again with plain regression on the Boston Housing data,
even though here we have theoretical p-values.
The point is that we can compare simulated with theoretical raw p-values:
#
Nsim = stats.obs[v])) / (1+ nrow(stats.null)) )
# [A more efficient way of computing the same is using matrix computations, but who cares for this size:]
pvals.raw.sim.2 = stats.obs)) / (1+nrow(stats.null))
all(pvals.raw.sim == pvals.raw.sim.2)
#
# These p-values should be compared with 'pvals.raw.theory':
round(cbind(pvals.raw.theory, pvals.raw.sim,
simulation.error=2*sqrt(pvals.raw.sim*(1-pvals.raw.sim)/(1+Nsim)) # sd of proportion
), 4)
## pvals.raw.sim pvals.raw.theory simulation.error
## (Intercept) 0.0001 0.0000 0.0002
## CRIM 0.0010 0.0011 0.0006
## ZN 0.0006 0.0008 0.0005
## INDUS 0.7313 0.7383 0.0089
## CHAS 0.0024 0.0019 0.0010
## NOX 0.0001 0.0000 0.0002
## RM 0.0001 0.0000 0.0002
## AGE 0.9602 0.9582 0.0039
## DIS 0.0001 0.0000 0.0002
## RAD 0.0001 0.0000 0.0002
## TAX 0.0007 0.0011 0.0005
## PTRATIO 0.0001 0.0000 0.0002
## B 0.0006 0.0006 0.0005
## LSTAT 0.0001 0.0000 0.0002
# The agreement between the two columns is within the expected range.
Here is how you adjust these p-values for simultaneity:
First make these statistic comparable by running them through their CDFs.
These have to be estimated, which can be done with the rank transformation,
We estimate 1-CDF basically by (N+1-rank)/N = 1-(rank-1)/N
which is pretty much the empirical CDF based on simulation.
By convention, we do this on the simulated and the observed statistics
combined, the idea being that under the null the observed is just
like one of the simulated:
Under the null, each rank is equally likely for the observed.
[We will place the observed at the top of the matrix
#
# Combine obs and null stats and rank the columns:
stats.ranks = 1/(Nsim+1) = 1/nrow(stats.ranks)
# Look at a few rows:
stats.pvals[1:5,]
# Now for simultaneity adjustment:
# You have to ask what the minimum p-value is in each row:
stats.pvals.min Considerable agreement between the two adjusted pvalues.
# The adjustment by simulation has a higher bottom because of ties at
# the minimum p-value (~ maximal rank).
# Intuitively: The maximal rank is taken on in 8 rows,
# so the minimum adjusted p-values is 8/(Nsim+1).
# In this example there was of course no need to estimate the CDF
# with the rank transformation. However, because a theoretical
# CDF is available, we can compare with the theoretical adjustment
# (which also relies on simulation but not for CDF estimation).
- PS: Bikram''s remark:
There exists theory and methodology for sequential testing.
. Sequential Neyman-Pearson lemma (simple vs simple hypotheses)
. A. Wald''s sequential testing based on likelihodd ratio tests
- PS: Why the silly Nsim values such as 9999?
Reason: The observed is treated as one of the null values
under the null hypothesis.
==> We want Nsim+1 to be round, not Nsim.
THINK ABOUT THE FOLLOWING IDEA!
-----------------------------------------------------------------
| How would you modify the min-pvalue approach |
| if you DON''t want to control the FWER, i.e., |
| P[ number of false rejections > 0 ] <= alpha |
| but the following criterion instead: |
| P[ number of false rejections > 1 ] <= alpha |
| I.e., you''re ok with one false rejection but not two. |
-----------------------------------------------------------------
Hint: Modify the min-pvalue statistic.
Check papers by Romano (@ Stanford), or Lehmann & Romano (book)
#====================================================
- Scheffe simultaneous inference:
. Sometimes we are interested not just in inference
for betaj''s, but also for linear combinations thereof.
Consider a categorical variable with 3 groups:
base group A and other groups B and C.
One may be interested in any kind of difference
among the groups, individually and after arbitrary
pooling. Examples:
B-A = betaB
C-A = betaC
C-B = betaC - betaB
(B+C)/2 - A = 1/2*betaB + 1/2*betaC
(A+C)/2 - B = 1/2*betaC - betaB
B*2/3 + A*1/3 - C = 2/3*betaC - betaB
...
So every linear combination c1*betaB + c2*betaC
represents a generalized difference among the
groups A, B, C:
(-c1-c2)*A + c1*B + c2*C = c1*betaB + c2*betaC (*)
[Coefficient vectors such as (-c1-c2,c1,c2) that sum up
to zero are called "contrast vectors", and the
linear forms (*) they generate are called "contrasts".
If A is the reference group, then the coefficients reduce
to arbitrary vectors (c1,c2) forming a linear space.
One would like to search through arbitrary meaningful
contrasts and assess their statistical significance
with RIs and CIs.
Note that the multiple regression may include other
predictors than this one categorical variable (i.e.,
the two dummy predictors the three groups require),
hence we are interested in linear forms of the betas with
coefficient vectors that form a subspace of R^{p+1}.
- Summary statement of the purpose of Scheffe simultaneous inference:
. Consider linear forms of regression coefficients:
c0*beta0 + c1*beta1 + ... cp*betap (**)
for (c0,c1,...,cp) in a linear subspace of R^{p+1}.
Scheffe provides simultaneous RIs and CIs for all such
derived parameters.
. Scheffe provides simultaneously valid inference by
providing t-statistics for all derived parameters (**)
and performing the max-t approach for these uncountably
many t-statistics.
. This sounds scarier than it is, but the max-t distribution
turns out to be a root-F distribution, and the F derives
from the F-test of the assumption that all (**) are zero,
as we shall see now.
- Technical execution of Scheffe simultaneous inference:
. Let V be a linear subspace of R^{p+1}.
Denote its elements by c = (c0,c1,...,cp)^T
. In the above example, if there are predictors other
than the categorical variable, then the vectors 'c'
would be zero other than for betaB and betaC:
V = { (0,0,...,0,cB,cC,0,...0)^T | cB,cC in R }
. Translate the estimate 'c^T b' of the parameter 'c^T beta'
to linear forms of y rather then b. Because each bj is
a linear form of y through the adjusted predictors formula,
the linear form c^T b = c0*b0 + c1*b1 + ... + cp*bp will
also be a linear form of y. This time it is easier to
go through the triple-x formula, though:
b = (X^T X)^{-1} X^T y
c^T b = c^T (X^T X)^{-1} X^T y
==> C := X (X^T X)^{-1} c
c^T b = C^T y
C contains the coefficients of the linear form of y
that produces the estimate c^T b of c^T beta.
W := { C = X (X^T X)^{-1} c | c in V }
is also a linear space, but it is a subspace of R^N,
in fact, a subspace of cspan(X).
. What is the t-statistic for testing H0: c^T beta = 0 ??
Answer:
t_C = < y, C/|C| > / s
. How can we achieve simultaneity of inferences for all C in W?
Answer: max-t over all C in W
max_{C in W\0} |t_C| = ???
This problem can be solved explicity through geometry.
Note that we are looking for the unit vector C/|C| in W
with the largest cosine, i.e., smallest angle, to y.
==> C ~ orthogonal projection of y onto W
Let H_W be the orthogonal projection onto W.
max_{C in W\0} |t_C| = max_{C in W\0} || / s
= || / s
= / s
= |H_W y| / s
NOTE: The null distribution of
(|H_W y|^2 / dim(W)) / s^2 ~ F(dim(W),N-p-1)
|H_W y|^2 / s^2 ~ F(dim(W),N-p-1) * dim(W)
==> |H_W y| / s ~ sqrt(F(dim(W),N-p-1) * dim(W))
==> Use 1-alpha quantiles as multipliers of standard error of c^T b.
Keep in mind we are working under a null assumption: y = eps ~ N(0,sig^2 I)
Standard error of c^T b:
V[c^T b] = c^T V[b] c = c^T (X^T X)^{-1} c * sigma^2
CI for c^T beta:
----------------------------------------------------------
| [ c^T b +- q * sqrt(c^T (X^T X)^{-1} c) * s ] |
| |
| where q = sqrt( qf(1-alpha, dim(W), N-p-1) * dim(W) ) |
----------------------------------------------------------
Recall: dim(W) = dim(V) = dim of space of contrasts
Reason: We assume X has full rank, else (X^T X)^{-1} would not exist.
Gambling guarantee: We don''t know ever whether 'c^T beta in [ ... ]'
is true for our dataset,
but it is the case for a fraction 1-alpha of datasets.
P[ c^T beta in [ ... ] forall c in V | beta is true ] = 1-alpha
^ ^^^^^^^^^^^^^
Fraction of datasets Simultaneity over contrasts with coefficient vectors c in V.
- Comparison of interval widths:
. Graphical illustration:
# Residual dfs: Arbitrary choices, here for the Boston Housing data
df.res Adjustment with Bonferroni and for exact independence are very close.
Scheffe intervals are the widest because they provide the strongest guarantees.
. Asymptotic comparison of interval widths:
+ Independent max-|z|: the multiplier of standard error
grows with the number 'ntest' of comparisons like
ntest Remove the intercept by centering the cols of X and y.
Write the new location coordinates again as xx but xx = (x1,...,xp)^T.
Now the xx form a linear space of dimension p.
Apply Scheffe with dimension dim(W)=p.
One more hitch: We also need to assume that the predictors are
interval scale (making the whole real line accessible to them).
For ratio scale or prob scale or dummies or counting scales,
we are not obtaining a space of 'coefficients' formed by the
location vectors xx.
More details: xx^T betahat = xx^T (X^T X)^{-1} X^T y
C(xx) = X (X^T X)^{-1} xx
We really only need its normalized version to form
t(xx) = < C(xx)/|C(xx)|, eps > / sigmahat
Forming the max-|t| we have
max_xx |t(xx)| = max_xx |< C(xx)/|C(xx)|, eps >| / sigmahat
What kind of set is formed by
S := { C(xx)/|C(xx) : xx = (x1,x2,..,xp) } ?
It is a subset of the unit sphere in R^N,
but its intrinsic dimensionality is no more than p.
(After empirically centering the columns of the design matrix, the
origin of xx-space corresponds to the p-tuple of empirical means.)
Fact: If the potential values of xx-space contain an open neighborhood
of the origin, then the set S is a (p-1)-sphere in a p-dim
subspace of N-space.
==> All continuous scales (interval, ratio, probability/freq)
result in full Scheffe inference if simultaneity is desired
for ALL POSSIBLE LEGITIMATE locations in predictor space.
==> Discrete scales (dummy, counting) result in a set S that
consists of finitely or countably many points that are
not dense in the (p-1)-sphere in p-dim subspace in N-space.
This case does not require full Scheffe inference.
Ex: 5 dummy variables form a set of 32 possible combinations
of 0 and 1 values, hence their set S consists of just
32 points.
#====================================================
- Implication of FWER control: "STRONG ERROR CONTROL" (often, not always)
Assume you performed a large number p of t-tests: t1, t2, ..., tp.
Assume you used a threshold K for simultaneous inference based on max-|t|:
P[ max |tj| > K | beta1 = beta2 = betap = 0 ] = alpha (*)
^^^^^^^^^^^^^^^^^^^^^^^^^
Global null hypothesis
(excludes testing the intercept beta0)
Problem:
-----------------------------------------------------
| In actual data analysis, some of the tj''s will |
| have resulted in rejections, others not. |
| |
| What can we say about the rejections? |
| Will they be true alts with high probability? |
| |
| Conversely, what can we say about retentions? |
| Will they be true null with high probability? |
| |
| If we use conservative statistical inference |
| calibrated under a global null hypothesis, |
| what does this imply for a mix of nulls and alts? |
-----------------------------------------------------
FRAMEWORK: Illustrated with regression, but works for
general multiple testing situations.
INDEX SETS for parameters/regression coefficients being tested:
H00 := { 1, 2, ..., p } indeces of the GLOBAL null (UNREALISTIC)
H0 := { j : betaj == 0, 1 <= j <= p } indeces of TRUE NULLS (UNKNOWN !!!!!!!!!!!!)
H1 := { j : betaj != 0, 1 <= j <= p } indeces of TRUE ALTS (UNKNOWN !!!!!!!!!!!!)
FWER control:
---------------------------------------------------------------------
| sup_{P in H00.dist} P[ ( max_{j in H00} |tj| ) > K ] <= alpha |
---------------------------------------------------------------------
Note: H00 = H0 U H1 (union of sets)
The following we do know from data -- they are the results of our testing:
H0.hat := { j : |tj| <= K } Retentions
H1.hat := { j : |tj| > K } Rejections
Both are RANDOM INDEX SETS !!!!!!!!!!
They differ from dataset to dataset.
MIND GAME:
We do not know the true state of nature, P(dy) = N(X beta, sig^2 I) and
the true H0 and H1 obtained from the true beta = (beta0,beta1,...,betap)^T.
But we hope that H0.hat and H1.hat can inform us about H0 and H1
with high probability (i.e., for most datasets).
----------- THEOREM: STRONG ERROR CONTROL --------------
| |
| . All rejections are true alts with high probability: |
| |
| P[ H1.hat subset of H1 ] >= 1-alpha (1) |
| |
| |
| . All true nulls are retained with high probability: |
| |
| P[ H0 subset of H0.hat ] >= 1-alpha (2) |
| |
--------------------------------------------------------
FACT: (1) <==> (2)
The equivalence follows from simple set theory:
A subset of B <==> B^C subset of A^C (complements)
NOTE: These results are all we can hope for in this generality.
- There is no way to find a guarantee of the following kind:
P[ H1 subset of H1.hat ] >= 1-alpha (1) WRONG, WRONG, WRONG !!!!
That is, a guarantee that our rejections contain ALL true alternatives.
Q: Why is this impossible?
A: Alternatives can be very weak, e.g., betaj close to 0 but still != 0.
Realistic sample sizes might be insufficient to reach rejection for such betaj.
[Recall that standard errors converge to zero but slowly, ~ 1/sqrt(N).]
- Equivalently, there is no way to find a guarantee such as
P[ H0.hat subset of H0 ] >= 1-alpha (2) WRONG, WRONG, WRONG !!!!
That is, a guarantee that ALL our retentions are true nulls.
Q: Why is this impossible?
A: Some retentions may arise from weak alternatives, betaj ~ 0 but != 0,
whose rejection is unattainable with realistic sample sizes.
- In practice we can do power calculations for |betaj| > eps > 0.
I.e., we decide what kind of 'effect size' |betaj| > eps is worth looking for.
We can then calculate a minimal sample size to have a prescribed power
(= prob of rejection under a worthwhile alternative, |betaj| > eps > 0).
FOR THE PROOF OF STRONG ERROR CONTROL, DEFINE SETS OF DISTRIBUTIONS:
Full Model = { P(dy) = N(X beta, sig^2 I): beta in Reals^{p+1}, sig > 0 }
H0.dist = { P(dy) = N(X beta, sig^2 I): betaj == 0 for j in H0, sig > 0 }
H00.dist = { P(dy) = N(X*beta, sig^2 I): betaj == 0 for j in H00, sig > 0 }
-----------------------------------------------------
IMPORTANT: | H0 subset H00, but H0.dist superset of H00.dist | !!!!!!!!!
-----------------------------------------------------
^^^^^^ ^^^^^^^^
H0 constrains fewer parameters than H00, hence
H0.dist has more distributions than H00.dist.
In fact: H00.dist = { N(x0*beta0, sig^2 I): beta0 in Reals, sig > 0 }
NOTE: H0.P contains distributions with betaj == 0 for j in H1. That''s ok.
We actually don''t need H0 to be the exact set of true nulls.
H0 can be any subset of the true nulls.
PROOF: inf_{P in H0.dist} P[ H0 subset of H0.hat]
= inf_{P in H0.dist} P[ ( max_{j in H0} |tj| ) <= K ] \ ????????????????????
?= inf_{P in H00.dist} P[ ( max_{j in H0} |tj| ) <= K ] / ????????????????????
>= inf_{P in H00.dist} P[ ( max_{j in H00} |tj| ) <= K ] from FWER control over H00
>= 1 - alpha
SUBTLETY:
. To make the proof work, we need an argument for '?=' --> '>=' .
. However, at that stage we have the opposite inequality: '?=' is '<=' .
Reason: Because H0.dist is a superset of H00.dist,
inf over H0.dist is <= inf over H00.dist,
which is the wrong direction from what we need.
. Yet if suitable PIVOTALITY holds, we get '==' !!!
Here is how: The vector of null t-statistics, i.e,
{ tj: j in H0 }
has a joint distribution that is independent of the
nuisance values betaj for j in H1.
Reason: { xj.adj: j in H0 } _|_ { xk: k in H1 }
==> numerator distributions of tj do not depend on
sum_{k in H1} xk betak
This nuisance part of the model is orthogonalized out
by xj.adj for j in H0 in the numerators of tj.
==> For j in H1, we can set betaj==0, and
the null distribution of { tj: j in H0 } is the same.
==> '?=' above is not only '<=' but '=='.
. While there is no deep math, there is subtlety in the proof.
. Counter example where pivotality is violated: correlations !
T12 = corr(X1,X2); T13 = corr(X1,X3); T23 = corr(X2,X3)
These are sample correlations used as test statistics.
Null hypothesis: C(X1,X2) = 0 and C(X1,X3) = 0
Global '' : C(X1,X2) = 0 and C(X1,X3) = 0 and C(X2,X3) = 0
[C(.,.) is the true population correlation.]
Why is the joint distribution between T12 and T13 affected by
whether C(X2,X3) = 0 is true or not?
Think of X1 as the response.
Then T12 and T13 are actually equivalent to t-statistics for
simple regressions of X1 on X2 and X1 and X3, respectively.
Now imagine that X2 and X3 are highly collinear, i.e., their
joint distribution is VERY far from the null hypothesis C(X2,X3) = 0.
T12 and T13 will then be VERY highly correlated with each other.
Of course they are correlated even when C(X2,X3) = 0, but less so.
At any rate, the joint distribution of T12 and T13 is affected
by whether the larger null hypothesis is true or not.
- Challenge question:
Can one generalize the strong error control result
from max-|t| to max2-|t|, where
max2-|t| is the second largest |tj|?
The desired result would be that whatever the set of true nulls is,
we fail to find at most one true null:
P[ | H0 \ H0.hat | <= 1 ] >= 1-alpha (2 - Miss-1 Strong Error Control)
P[ H0 subset of H0.hat ] >= 1-alpha (2 - Original Strong Error Control)
Conversely:
P[ | H1.hat \ H1 | <= 1 ] >= 1-alpha (1 - Miss-1 Strong Error Control)
P[ H1.hat subset of H1 ] >= 1-alpha (1 - Original Strong Error Control)
- Next would be: FDR -- False Discovery Rate
==> An extremely important concept that provides
narrower intervals with a different guarantee.
But it works best under the assumption of independent tests.
================================================================