Typically, in studies that have large numbers of predictor variables available, many of those
variables may be unrelated to the response among individuals in the population of interest. The
problem then is to identify the important variables and fit a reduced model to predict responses.
This problem is referred to as variable selection. This problem will be discussed here in the
context of prediction of new observations rather than fitting an existing data set. Variable
selection represents an attempt to find an optimal balance between precision and bias. As we have
seen, addition of a variable to a model always reduces residual error variance unless the new
variable is an exact linear combination of the predictor variables already in the model or it has
0 correlation with the residuals of the current model. Even if a variable is generated randomly,
the probability of that happening is essentially 0. The effect of including weakly correlated
variables in a model is increased bias when the model is used to predict responses for new
observations not in the data set used for fitting. We refer to such situations as
over-fitting.

If the number of potential predictor variables is small, then models with each possible subset of
predictors could be fit and compared. Obviously residual error variance or r-squared should not be
the basis for comparison of models since criteria based on those measures would always select the
largest model. There are two basic approaches to this problem that are used most often, penalized
likelihood methods and shrinkage methods.

Penalized likelihood methods subtract from the maximized likelihood function a quantity that is a
function of the number of variables in the model. These likelihood penalties are designed to adjust
for the increase in bias that would occur if a noise variable is added to the model. One of the
earliest such methods is Mallow's Cp statistic, defined by

where is the residual sum of squares, is the number of variables in the model, and
is a low-bias estimate of residual error variance that does not depend on . Typically,
the low-bias estimate of residual error variance is obtained from the largest possible model. Note
that Mallow's definition used instead of .) Since is constant wrt , this is
equivalent to

(1)

An information-based penalty was developed by Aikake and is referred to as Aikake's Information
Criterion (AIC). A similar measure introduced by Schwartz is referred to as Bayes Information
Criterion (BIC). These are defined for linear regression by

Addition of a variable decreases RSS but increases the penalty. The best model is the one with the
smallest value of the criterion. Note that the dimension penalties in these criteria do not include
precision associated with the model being evaluated, nor do they include model bias.

When there are more than a few potential predictor variables, it is most efficient to use a forward
stepwise approach to the selection of variables. The variable most strongly correlated with the
response is selected initially and a linear model is fit. At each step the next variable selected
from the remaining variables is the one most strongly correlated with the residuals of the current
fit. This is continued until all variables have been added to the model or a predefiined stopping
criterion has been satisfied. The selection criterion (Cp, AIC, or BIC) is evaluated at each step
and the model selected is the one with minimum value of the criterion.

This process is performed in R with the step() function. By default this function
performs forward stepwise regression using AIC for the selection criterion. Steps are terminated
when AIC values of all remaining variables are higher than AIC of the current model. Choice of BIC
criterion is made by the argument k=log(n) where n is the sample size. This
function is implemented by updating the QR decomposition and so has computational complexity that
is the same order of magnitude as the complexity of obtaining the QRD using all predictors.

Shrinkage methods subtract from the likelihood function a penalty that is proportional to a norm
of the coefficients. For standard linear model assumptions the maximized likelihood is the sum of
squared residuals and so the goal is to minimize

where c>0 is a tuning parameter. The idea here is that larger models would have larger
values for the norm of the coefficients, so the reduction in RSS associated with a larger model
would need to be high enough to offset the increased norm of its coefficients. Coefficients
that are essentially 0 in these methods are removed from the model. If the 2-norm is used here,
the method is referred to as ridge regression, but this does not ordinarily result in
reduction of the number of variables. Use of the 1-norm almost always results in removal of weak
variables. This method is referred to as the lasso.

Recall that reduction in RSS when a variable a is added to a model is given by

where

Note that d is the projection of a onto the orthogonal complement of
range(X). This projection removes all of the partial correlations with variables
already in the model between a and Y. In practice removal of all of those partial
correlations may cause the stepwise process to follow a sub-optimal path. An alternative algorithem
can be defined by taking only a very small step in the direction of the projection onto the
orthogonal complement of range(X). This algorithm is referred to as stagewise regression.
An efficient implementation of forward stagewise regression, referred to as Least Angle
Regression (LARS) is available in the contributed package lars. The authors of that
package show that LARS is related to lasso variable selection. This package includes
options for forward stagewise, LARS, and lasso. Mallow's Cp statistic is used
for variable selection. The lars package is not distributed with R. It must be
downloaded and installed from the CRAN site via the Package Installer menu item for
R.

Big Data problems contain large numbers of potential predictor variables and so variable selection
is an integral step in the analysis of such data. For example, identification of genetic biomarkers
with genomic data may lead to new drugs and treatments as well as better understanding of disease
mechanisms. However, genomic data sets often contain tens of thousands of variables in the form of
gene expressions. Further exacerbating that problem is the much smaller sample sizes typically used
for such studies. Classical methods for variable selection almost always leads to over-fitting the
data by selecting variables that may appear useful for the sample in the study, but they only add
bias to prediction of responses from new individuals from the population. For these reasons,
variable selection for Big Data remains an important topic for future research.