Virtues of Design Perspective

Separation of selection and inference.

Garden of forking paths.

[M]odels become stochastic in an opaque way when their
selection is affected by human intervention based on post-hoc considerations such
as “in retrospect only one of these two variables should be in the model” or “it turns
out the predictive benefit of this variable is too weak to warrant the cost of collecting
it.” (Berk et al 2013).

Seeds of Design Perspective

Wasserman's HARNESS

Response to Lockhart et al LASSO hypothesis testing paper.

Randomly split data.

Model selection with one half.

Conditional on selected model, standard inference on other half.

Seeds of a Design Perspective

Issues with Data Splitting

Some statisticians are uncomfortable with
data-splitting. There are two common objections. The first is that the inferences
are random: if we repeat the procedure we will get different answers. The second
is that it is wasteful.(Wasserman in response to Lockhart et al.)

(Finally) A Contribution

Principled Data Splitting

Can we use design principles to improve data-splitting techniques?

Splitting can be skewed to alleviate concerns.

Optimization can be exchanged with randomization to navigate optimality/robustness tradeoff.

Key idea: Inference is already conditional on \( X \). “Splitting on observables”
can be used to improve power, restrict randomization without biasing inference.

A Small Result

Assume \( (Y_i, X_i) \) multivariate normal, as before.

Procedure:

Use a penalized log-likelihood information criterion to select a model (AIC, BIC, DIC, or other)

Compute predictive intervals for \( Y^{rep} \) using selected model.

Lemma:
Under the multivariate normal model, for fixed split sizes in the model selection set \( n_1 \) and the inference set \( n_2 \), the optimal (oracle) splitting policy maximizes the leverage of the points in the inference set with respect to the selected model.

Because of multivariate normality, residuals for any set \( A \) are mean-zero normal, so
\[ \hat \sigma_A^2 \sim \sigma^2_A \chi^2_{n-p}, \]
so all expectations of \( myIC \) do not depend on \( X \).

Meanwhile, the predictive variance has the form:
\[ Var(Y) = X_A^{rep}(X_A^{\top}X_A)^{-1}X_A^{rep\top} \sigma^2_A \]
with trace decreasing in the leverage of inference set.