Julia Pilowsky

Guide to Mixed Models in R

A Practical Guide to Mixed Models in R

Preface

I created this guide so that students can learn about important statistical concepts while remaining firmly grounded in the programming required to use statistical tests on real data. I want this to be a guide students can keep open in one window while running R in another window, because it is directly relevant to their work.

In that spirit of openness and relevance, note that I created this guide in R v 3.1.0 and used the following packages:

car v 2.0

MASS v 7.3

lme4 v 1.1

mlmRev v 1.0

agridat v 1.8

MCMCglmm v 2.19

ggplot2 v 0.9.3.1

scapeMCMC v 1.1

1. Is a mixed model right for your needs?

A mixed model is similar in many ways to a linear model. It estimates the effects of one or more explanatory variables on a response variable. The output of a mixed model will give you a list of explanatory values, estimates and confidence intervals of their effect sizes, p-values for each effect, and at least one measure of how well the model fits. You should use a mixed model instead of a simple linear model when you have a variable that describes your data sample as a subset of the data you could have collected.

What do I mean by that? Let's take a look at some preliminary data from a paper wasp kin recognition project I'm working on.

My response variables of interest are Aggression and Tolerance. Aggression is the number of aggressive behaviors in a sixty minute period. Tolerance is the number of tolerant behaviors in a sixty minute period. I am interested in the effects of relation (whether the wasps came from the same or different colonies) and season (early or late in the colony cycle) on these response variables. These effects are "fixed" because no matter where, how, or how many wasps I had sampled, I would have still have had the same levels in the same variables: same colony vs. different colony, and early season vs. late season.

There are two other variables, though, that would not remain fixed between samples. Observer represents the person who scored the interaction and Test.ID represents the group of three wasps observed for sixty minutes. The levels of Observer would be different if I had sampled in a different year, because different undergraduate volunteers would be available to observe behavior. The levels of Test ID would also vary between samples, because I could always rearrange which wasps participate in each experimental trial. Each trial is a unique sub-sample of the wasps I collected at that time. If I had been able to test the wasps individually, and if all observers had scored all interactions, I wouldn't have any random effects. But instead, my data are inherently "lumpy," and the random effects describe that lumpiness. Mixed models allow us to account for the lumpiness of data.

Before you proceed, you will also want to think about the structure of your random effects. Are your random effects nested or crossed? In the case of my study, the random effects are nested, because each observer recorded a certain number of trials, and no two observers recorded the same trial, so here Test.ID is nested within Observer. But say I had collected wasps that clustered into five different genetic lineages. The 'genetics' random effect would have nothing to do with observer or arena; it would be orthogonal to these other two random effects. Therefore this random effect would be crossed to the others.

2. What probability distribution best fits your data?

Say you've decided you want to run a mixed model. The next thing you have to do is find a probability distribution that best fits your data. There are many ways to test this, but I'll show you one. This method requires the packages 'car' and 'MASS'. Note that the negative binomial and gamma distributions can only handle positive numbers, and the Poisson distribution can only handle positive whole numbers. The binomial and Poisson distributions are different from the others because they are discrete rather than continuous, which means they quantify distinct, countable events or the probability of these events. Now let's find a fitting distribution for my Aggression variable.

require(car)

## Loading required package: car

require(MASS)# This is so that distributions that must be non-zero can make sense of my# datarecog$Aggression.t<-recog$Aggression+1qqp(recog$Aggression.t,"norm")

# lnorm means lognormalqqp(recog$Aggression.t,"lnorm")

# qqp requires estimates of the parameters of the negative binomial, Poisson# and gamma distributions. You can generate estimates using the fitdistr# function. Save the output and extract the estimates of each parameter as I# have shown below.nbinom<-fitdistr(recog$Aggression.t,"Negative Binomial")qqp(recog$Aggression.t,"nbinom",size= nbinom$estimate[[1]],mu= nbinom$estimate[[2]])

Check out the plots I've generated using qqp. The y axis represents the observations and the x axis represents the quantiles modeled by the distribution. The solid red line represents a perfect distribution fit and the dashed red lines are the confidence intervals of the perfect distribution fit. You want to pick the distribution for which the largest number of observations falls between the dashed lines. In this case, that's the lognormal distribution, in which only one observation falls outside the dashed lines. Now, armed with the knowledge of which probability distribution fits best, I can try fitting a model.

3. How to fit a mixed model to your data

3a. If your data are normally distributed

First, a note: if your data best fit the lognormal distribution, do not transform them. This is true for any type of transformation you might apply to your data to make them normal. If you can transform your data to normality, common wisdom says you should use the transformed data. More recent statistics literature has entirely changed stance on this matter, however, because transformation makes interpretation of model results more difficult, and it makes mischief with the variance of the transformed variable. Even if your data are transformable to normality, they are still not normal, and you should move on to the next section.

If your data are normally distributed, your life will be a little easier, because you can use a linear mixed model (LMM). You will want to load the lme4 package and make a call to the function lmer. The first argument to the function is a formula that takes the form y ~ x1 + x2 ... etc., where y is the response variable and x1, x2, etc. are explanatory variables. Random effects are added in with the explanatory variables. Crossed random effects take the form (1 | r1) + (1 | r2) ... while nested random effects take the form (1 | r1 / r2).

The next argument is where you designate the data frame your variables come from. The argument after that is an important one. This is where you can designate whether the mixed model will estimate the parameters using maximum likelihood or restricted maximum likelihood. If your random effects are nested, or you have only one random effect, and if your data are balanced (i.e., similar sample sizes in each factor group) set REML to FALSE, because you can use maximum likelihood. If your random effects are crossed, don't set the REML argument because it defaults to TRUE anyway.

Lest this all seem too abstract, let's try this with some data. We will use some data from my very first published study about song in superb starlings. In this study, we were interested in sexual dimorphism, the differences between male and female song, and whether birds of different social ranks, helper and breeder, sang differently. Our random effect was social group. Mean pitch of song fits a normal probability distribution.

Let's look at these results. First we get some measures of model fit, including AIC, BIC, log likelihood, and deviance. Then we get an estimate of the variance explained by the random effect. This number is important, because if it's indistinguishable from zero, then your random effect probably doesn't matter and you can go ahead and do a regular linear model instead. Next we have estimates of the fixed effects, with standard errors. This information may be enough for you. Some journals like you to report the results of these models as effect sizes with confidence intervals. Certainly when I look at the fixed effects estimate, I can already tell that mean pitch does not differ between sexes and social ranks. But some journals want you to report p-values.

The creators of the lme4 package are philosophically opposed to p-values, for reasons I'll not go into here, so if you want some p-values you'll have to turn to the Anova function in the car package.

The Anova function does a Wald test, which tells us how confident we are of our estimate of the effect of sex and social rank on pitch, and the p-value tells me that I should not be confident at all.

There is one complication you might face when fitting a linear mixed model. R may throw you a "failure to converge" error, which usually is phrased "iteration limit reached without convergence." That means your model has too many factors and not a big enough sample size, and cannot be fit. Unfortunately, I don't have any data that actually fail to converge on a model that I can show you, but let's pretend that last model didn't converge. What you should then do is drop fixed effects and random effects from the model and compare to see which fits the best. Drop fixed effects and random effects one at a time. Hold the fixed effects constant and drop random effects one at a time and find what works best. Then hold random effects constant and drop fixed effects one at a time. Here I have only one random effect, but I'll show you by example with fixed effects.

Note that this anova function is not the same as the Anova function we used to evaluate the significance of fixed effects in the model. This anova function with a lowercase 'a' is for comparing models. The p values indicate that there are no groundshakingly important differences between the models. We can also compare the AIC values and note that the model with the lowest AIC value is the one with no fixed effects at all, which fits with our understanding that sex and social rank have no effect on song pitch. Whatever you approach you take, be sure to report in your manuscript the p values or AIC values you used to pick the best model.

3b. If your data are not normally distributed

This is where life gets interesting. You see, the REML and maximum likelihood methods for estimating the effect sizes in the model make assumptions of normality that don't apply to your data, so you have to use a different method for parameter estimation. The problem is that there are many alternative estimation methods, each run from a different R package, and it can be hard to decide which one suits. I will guide you through this decision process with examples.

First, we need to test whether we can use penalized quasilikelihood (PQL) or not. PQL is a flexible technique that can deal with non-normal data, unbalanced design, and crossed random effects. However, it produces biased estimates if your response variable fits a discrete count distribution, like Poisson or binomial, and the mean is less than 5 - or if your response variable is binary.

The Aggression variable fits the log-normal distribution, which is not a discretized distribution. That means we can proceed with the PQL method. But before we proceed, let's return to the matter of transformation to normality. The reason we want to use a GLMM for this is that if we imagine a stastical method as E(x), E(ln(x)) is not the same as ln(E(x)). The former is performing a LMM on a transformed variable, while the latter is performing a GLMM on an untransformed variable. The latter is better because it better captures the variance of x.

Make sure you have the MASS package loaded. Note that instead of taking all the fixed and random effects as one formula, the random effects get their own argument in the glmmPQL function. To set the distribution to log-normal, we set the family to gaussian (another word for normal) and the link to log. The link can be anything, though if you want to use something besides log or inverse then you'll have to research how to customize the link function yourself.

This model suggests that season has an effect on Aggression, that is, wasps collected late in the colony cycle were less aggressive than those collected early. It also suggests that the relationship between the wasps has an effect; they are more likely to be aggressive toward strangers than nestmates. I would report these statistics in a paper with the estimate, standard error, t-value, and p-value.

So what if the mean of your response variable is less than 5, or you have a binary response variable, and you can't use PQL? Here you're going to have to make another decision, because there are two alternatives you can use: the Laplace approximation and Markov chain Monte Carlo algorithms (MCMC). The Laplace approximation can handle up to 3 random effects. Any more than that, and you'll have to use MCMC, which is a Bayesian method that can be somewhat confusing.

Let's start with an example where we can use the Laplace approximation. We will use a sample dataset from the package mlmRev, which gives us data about how well some Dutch students did in school. For the purposes of this example, I will subset the data to only a few variables of interest, and simplify the "repeatgr" variable to a binary response.

Let's say we want to find out whether membership in a racial minority and socioeconomic status affect students' likelihood to repeat a grade. Our response variable is 'repeatgr', a binary response indicating whether the student repeated a grade or not. Racial minority status is a binary Y/N category and socioeconomic status is represented by 'ses', a numeric scale that ranges from 10 to 50, where 50 is the richest. Our random factor is 'schoolNR', which represents the schools from which the students were sampled. Because the response variable is binary, we will need a generalized linear mixed model with a binomial distribution, and because we have fewer than five random effects, we can use the Laplace approximation.

Strictly speaking, the Laplace approximation is a special case of a parameter estimation method called Gauss-Hermite quadrature (GHQ), with one iteration. GHQ is more accurate than Laplace due to repeated iterations, but becomes less flexible after the first iteration, so you can only use it for one random effect. We can use it in this example because our only random effect is 'schoolNR.' To go ahead with this method, we use the lme4 package again.

GHQ<-glmer(repeatgr~Minority+ses+ses*Minority+(1|schoolNR),data= bdf,family=binomial(link="logit"),nAGQ=25)# Set nAGQ to # of desired iterations

You can see there's no important difference between Laplace and GHQ in this case. Both show that socioeconomic class has a highly significant effect on students' likelihood to repeat a grade, though even with the logit transformation we can see that the effect size is small. There is one more consideration, though, when using this method. It becomes inaccurate when used on overdispersed data - that is, when the combined residuals are much larger than the residual degrees of freedom. When you use this method, you should check the model to make sure the data are not overdispersed. Here is some code that will help you with that.

overdisp_fun<-function(model) {## number of variance parameters in an n-by-n variance-covariance matrixvpars<-function(m) {nrow(m)*(nrow(m)+1)/2}# The next two lines calculate the residual degrees of freedommodel.df<-sum(sapply(VarCorr(model), vpars))+length(fixef(model))rdf<-nrow(model.frame(model))-model.df# extracts the Pearson residualsrp<-residuals(model,type="pearson")Pearson.chisq<-sum(rp^2)prat<-Pearson.chisq/rdf# Generates a p-value. If less than 0.05, the data are overdispersed.pval<-pchisq(Pearson.chisq,df= rdf,lower.tail=FALSE)c(chisq= Pearson.chisq,ratio= prat,rdf= rdf,p= pval)}

Save this code as an R script and source it. Use the function on the model you've created. These school data are not overdispersed. If yours are, don't panic. You can model overdispersion as a random effect, with one random effect level for each observation. In this case, I would use student ID as the random effect variable. Note that you cannot use this method for a binomially distributed population if you have only one observation per individual.

Let's move on to the case where we can't use glmmPQL (i.e., because the mean of Poisson data is too small or because the response variable is categorical) and we have five or more random effects. I couldn't find a sample dataset for this, so instead we will turn the variable of interest in a sample dataset into a binary response. Consider these data about barley farmers. Imagine that for a barley harvest to yield a profit, the income from that barley harvest must be greater than 140.

In this case, let's say we have no explanatory variables at all. We don't have any ideas about what fixed effects might influence whether a barley harvest turns a profit or not. We're interested in which random effects contribute to the variability of profit. After all, random effects are factors that change the variance of a response variable; sometimes we're trying to account for that variance to make the fixed effects clearer, but sometimes we're interested in the variances of fixed effects for their own sake. To do this, we will use MCMCglmm, which can not only handle many random effects, but provides confidence intervals for the random effects, which none of the other packages we've used here provide in their summary (though in lme4 you can use confint() on a fitted model to achieve the same end.)

The confusing part about MCMCglmm is that it is a Bayesian statistical method. All models make assumptions about the distribution of the variance in your data, but in a Bayesian method these assumptions are explicit, and we need to specify these assumed distributions. In Bayesian statistics, we call these priors. The priors I've used here are bog-standard and will work for most data. Feel free to use them yourself. Just keep in mind that one R structure needs to be specified for each fixed effect and one G structure needs to be specified for each random effect. Also remember my caution about the lognormal distribution: these priors may not play nicely with data modeled with a log link, so do some research on what priors to use for data on a log scale.

If we look at the means and confidence intervals, we can see that the only random effects that really contributes to the variability of profit are year and genotype. That is to say, the probability of a barley harvest turning a profit varies a lot from year to year, and between genotypes, but it doesn't vary much between districts, farmers, or places.

Closing thoughts: know your data

I've taken you on a whirlwind tour of mixed models. I emphasize this because a lot more goes into data analysis than I've shown you. There are some important steps that go before and after that I don't have space to cover in detail here, but that I'd like to touch upon. Those steps are graphing.

You can't really know which analyses are right for your data until you get familiar with them, and the best way to get familiar with them is to plot them. Usually my first step is to do density plots of my variable of interest, broken down by the explanatory variable I'm most curious about. Density plots are like histograms, except they aren't dependent on how large you make the bins along the x axis. There are a few ways to do make them, but here's how I do it in ggplot2, which makes very pretty graphs.

Here I split up my data by season and relation, my two fixed effects. We can see right away that the dataset contains an extreme positive outlier; by far most of the observations fall between 0 and 20 and there's an outlier or two throwing it off. This is good to know. We can also see that a high proportion of the late season observations are equal to zero.

Plotting is also important for assessing model fit. You can tell which model you fit does the best job describing the data by plotting the fitted values in various ways. One easy application is graphing the residuals of a model. If you imagine a model as a best-fit line going through the scatterplot of your data, the residuals are the distances of of the points in the scatterplot from the best-fit line. If the model fits, then if you plot residuals against the fitted values, you should see random scatter. If the scatter is not random that means there's some other random or fixed effect that explains variation in your data that you're missing.

Let's try plotting the residuals of the mixed model I fit for song pitch in superb starlings. What this plot does is create a dashed horizontal line representing zero: an average of zero deviation from the best-fit line. It also creates a solid line that represents the residual deviation from the best-fit line. I want the solid line to overlay the dashed line.

And the results are just as I hoped: the deviation from the best-fit line evens out to zero. If that solid line didn't cover the dashed line, that would mean the best-fit line does not fit particularly well.

Let's try a technique for graphically comparing MCMC models. We'll go back to our barley farmers example and use the scapeMCMC package for a diagnostic plot of the model. We're looking for graphs that resemble white noise around a line.

library(scapeMCMC)

## KernSmooth 2.23 loaded
## Copyright M. P. Wand 1997-2009

plotTrace(MCMC$VCV,log=TRUE)

These random effects are looking pretty spiky, not like white noise. So let's try refitting the model with more iterations. This is more computationally intensive, but produces more accurate results.

Now everything looks much closer to white noise around a line, which suggests a better model. Now let's compare the confidence intervals for the variance of the random effects between the two models.

# glue together the estimates and confidence intervals for the two modelsconf.int<-rbind(summary(MCMC)$Gcovariances,summary(MCMC2)$Gcovariances)# create a data frame with factors for which model and which random effectconf.int<-data.frame(conf.int,model=factor(rep(c("model1","model2"),each=5)),random.effect=dimnames(conf.int)[[1]])

## Warning: some row.names duplicated: 6,7,8,9,10 --> row.names NOT used

This is a reassuring plot because the estimates are very similar between the two models (though the estimate for year is a little lower in the second) but the confidence interval for year is markedly smaller in the second model, which means we can be more confident about this estimate. We can feel pretty good about our inferences on the second model, i.e., that genotype and year are the main contributors to variability.

Our brains are best at detecting patterns when they are presented visually, so plot your data and your models whenever you can. Learn to use the base, lattice, or ggplot2 package and it will serve you well for years to come.