Linear regression with a factor, using R

Table of Contents

Overview

Fitting models in R is simple and can be easily automated, to allow
many different model types to be explored. This tutorial shows how to
fit a variety of different linear regression models to continuous data
from different categories. This shows the R formula interface and also
demonstrates the power and flexibility of the plyr and ggplot2
packages for manipulating and visualising data, respectively.

Create and plot data

It is best to assemble a data frame of x and y data, to keep these
two vectors associated in a single object; subsequent fitting &
plotting of the data can then reference this data frame. The use of
'within' here performs this task without leaving copies of x and
y as separate objects. This approach is recommended, to avoid
cluttering up the R workspace with unnecessary objects.

Now plot the data, using the lattice package, which makes it easy
to display the separate categories within the data. Note that
lattice is a 'recommended' package, which means that it comes
bundled with the standard installation of R, but is not automatically
loaded by default, so you need to do so using the library function.

The legend is automatically created to match the point colours used to
identify the different categories in the class variable.

A Lattice plot of the data to be fitted, showing the 5 separate categories

You can clearly see that each category shows a roughly linear
relationship between x and y, with a different slope and
intercept.

Specify & fit linear models

Now let's specify a variety of different linear models to fit to the
data, using the formula interface in R. We want to model y in terms
of x and possibly also class, so the syntax starts with "y ~".
Formulae can be treated as normal objects in R, so you can generate
them by manipulating character strings, allowing us to avoid code
duplication by pasting this common initial part onto the different
other parts:

To fit the models to the data, we use the excellent 'plyr' package,
which makes it very easy to apply a similar action repeatedly to
subsets of data and combine the results together. In this case, we
want to run lm to fit a linear model to mydata, using a different
formula in each case. The 'llply' function reads in the formula
list, and runs lm for each entry on the mydata data frame. It then
outputs a list of lm models.

Extract model predictions & plot vs. raw data

Now we need to add in the model predictions for each x value and
also include the AIC relative measure of fit quality. This calculation
is performed using another plyr package function.

In this case, 'ldply' allows you to loop over each fitted model in
the models list and calculate some numbers. The results for each
model are added into the existing model data frame then collected
together and returned in a single data frame:

'predData' also contains a column called '.id', which gives the
name of the model corresponding to each row of the data frame, which
it gets from the names of the input list (i.e. 'models'). Model A is
the simplest case where class is not included, hence is reported as
a missing (factor) value ('<NA>') in the predicted data.

Next, add in the model formula, and create a customised label to be
used as a strip title for each panel in the plot. These strip titles
are ordered from best to worst fitting model, according to the AIC
value (lower is better).

Specifying 'colour = class' as an aesthetic attribute treats each
different category of 'class' as a separate group of data, with a
different colour for the points & lines used to represent it.

Fitted data + linear model prediction for 6 different models

Not surprisingly, the models with separate slopes and intercepts for
each category provide the best fit. However, this might well not be
the case if there was more noise on the data and/or smaller
differences between the slopes & intercepts in each category.

R source code

Click for the complete R source code for this tutorial. Note that this
is automatically generated ("tangled") from the org mode source file
for this document, which adds some extra commands to specify filenames
for plots (and to subsequently close the graphics device). Note that
if you use 'source' to read in the R code, the ggplot2 plots will
not be created as auto-printing is turned off when using 'source'
(see R FAQ 7.22 for more information).