Introduction to Generalized Linear Mixed Models

Background

Generalized linear mixed models (or GLMMs) are an extension of linear
mixed models to allow response variables from different distributions,
such as binary responses. Alternatively, you could think of GLMMs as
an extension of generalized linear models (e.g., logistic regression)
to include both fixed and random effects (hence mixed models). The
general form of the model (in matrix notation) is:

Where \(\mathbf{y}\) is a N x 1 column vector, the outcome variable;
\(\mathbf{X}\) is a N x p matrix of the p predictor variables;
\(\boldsymbol{\beta}\) is a p x 1 column vector of the fixed-effects regression
coefficients (the “betas”); \(\mathbf{Z}\) is the N x q design matrix for
the q random effects (the random complement to the fixed \(\mathbf{X})\);
\(\boldsymbol{\gamma}\) is a q x 1 vector of the random
effects (the random complement to the fixed \(\boldsymbol{\beta})\);
and \(\boldsymbol{\varepsilon}\) is a N x 1
column vector of the errors, that part of \(\mathbf{y}\) that is not explained by
the model, \(\boldsymbol{X\beta} + \boldsymbol{Z\gamma}\). To recap:

To make this more concrete, let’s consider an example from a
simulated dataset. Doctors \((q = 407)\) indexed by the \(j\)
subscript each see \(n_{j}\) patients. So our grouping variable is the
doctor. Not every doctor sees the same number of patients, ranging
from just 2 patients all the way to 40 patients, averaging about
21. The total number of patients is the sum of the patients seen by
each doctor

\[
N = \sum_{j}^q n_j
\]

In our example, \(N = 8525\) patients were seen by doctors.
Our outcome, \(\mathbf{y}\) is a continuous variable,
mobility scores. Further, suppose we had 6 fixed effects predictors,
Age (in years), Married (0 = no, 1 = yes),
Sex (0 = female, 1 = male), Red Blood Cell (RBC) count, and
White Blood Cell (WBC) count plus a fixed intercept and
random intercept for every doctor. For simplicity, we are only going
to consider random intercepts. We will let every other effect be
fixed for now. The reason we want any random effects is because we
expect that mobility scores within doctors may be
correlated. There are many reasons why this could be. For example,
doctors may have specialties that mean they tend to see lung cancer
patients with particular symptoms or some doctors may see more
advanced cases, such that within a doctor,
patients are more homogeneous than they are between doctors.
To put this example back in our matrix notation, we would have:

Because \(\mathbf{Z}\) is so big, we will not write out the numbers
here. Because we are only modeling random intercepts, it is a
special matrix in our case that only codes which doctor a patient
belongs to. So in this case, it is all 0s and 1s. Each column is one
doctor and each row represents one patient (one row in the
dataset). If the patient belongs to the doctor in that column, the
cell will have a 1, 0 otherwise. This also means that it is a sparse
matrix (i.e., a matrix of mostly zeros) and we can create a picture
representation easily. Note that if we added a random slope, the
number of rows in \(\mathbf{Z}\) would remain the same, but the
number of columns would double. This is why it can become
computationally burdensome to add random effects, particularly when
you have a lot of groups (we have 407 doctors). In all cases, the
matrix will contain mostly zeros, so it is always sparse. In the
graphical representation, the line appears to wiggle because the
number of patients per doctor varies.

In order to see the structure in more detail, we could also zoom in
on just the first 10 doctors. The filled space indicates rows of
observations belonging to the doctor in that column, whereas the
white space indicates not belonging to the doctor in that column.

If we estimated it, \(\boldsymbol{\gamma}\) would be a column
vector, similar to \(\boldsymbol{\beta}\). However, in classical
statistics, we do not actually estimate \(\boldsymbol{\gamma}\).
Instead, we nearly always assume that:

\[
\boldsymbol{\gamma} \sim \mathcal{N}(\mathbf{0}, \mathbf{G})
\]

Which is read: “gamma is distributed as normal with mean zero and
variance G”. Where \(\mathbf{G}\) is the variance-covariance matrix
of the random effects. Because we directly estimated the fixed
effects, including the fixed effect intercept, random effect
complements are modeled as deviations from the fixed effect, so they
have mean zero. The random effects are just deviations around the
value in \(\boldsymbol{\beta}\), which is the mean. So what is left
to estimate is the variance. Because our example only had a random
intercept, \(\mathbf{G}\) is just a 1 x 1 matrix, the variance of
the random intercept. However, it can be larger. For example,
suppose that we had a random intercept and a random slope, then

Because \(\mathbf{G}\) is a variance-covariance matrix, we know that
it should have certain properties. In particular, we know that it is
square, symmetric, and positive semidefinite. We also know that is has
redundant elements. For a q x q matrix, there are
\(\frac{q(q+1)}{2}\) unique elements. To simplify computation by
removing redundant effects and ensure that the resulting estimate
matrix is positive definite, rather than model \(\mathbf{G}\)
directly, we estimate \(\boldsymbol{\theta}\) (e.g., a triangular
Cholesky factorization \(\mathbf{G} = \mathbf{LDL^{T}}\)).
\(\boldsymbol{\theta}\) is not always parameterized the same way,
but you can generally think of it as representing the random
effects. It is usually designed to contain non redundant elements
(unlike the variance covariance matrix) and to be parameterized in a
way that yields more stable estimates than variances (such as taking
the natural logarithm to ensure that the variances are
positive). Regardless of the specifics, we can say that

\[
\mathbf{G} = \Sigma(\boldsymbol{\theta})
\]

In other words, \(\mathbf{G}\) is some function of
\(\boldsymbol{\theta}\). So we get some estimate of
\(\boldsymbol{\theta}\) which we call \(\hat{\boldsymbol{\theta}}\).
Various parameterizations and constraints allow us to simplify the
model for example by assuming that the random effects are
independent, which would imply the true structure is

The final element in our model is variance covariance matrix of the
residuals, \(\varepsilon\) or the condition covariance matrix of
\(\mathbf{y} | \boldsymbol{X\beta} + \boldsymbol{Z\gamma}\).
The most common residual covariance structure is

\[
\mathbf{R} = \boldsymbol{I\Sigma^2_{\varepsilon}}
\]

where \(\mathbf{I}\) is the identity matrix (diagonal matrix of 1s)
and \(\Sigma^2_{\varepsilon}\) is the residual variance. This
structure assumes a homogeneous residual variance for all
(conditional) observations and that they are (conditionally)
independent. Other structures can be assumed such as compound
symmetry or autoregressive. The \(\mathbf{G}\) terminology is common
in SAS, and also leads to talking about G-side structures for the
variance covariance matrix of random effects and R-side structures
for the residual variance covariance matrix.

So the final fixed elements are \(\mathbf{y}\), \(\mathbf{X}\),
\(\mathbf{Z}\), and \(\boldsymbol{\varepsilon}\). The final estimated
elements are \(\hat{\boldsymbol{\beta}}\),
\(\hat{\boldsymbol{\theta}}\), \(\hat{\mathbf{G}}\), and
\(\hat{\mathbf{R}}\). The final model depends on the distribution
assumed, but is generally of the form:

Count Outcomes

We could fit a similar model for a count outcome, number of
tumors. Counts are often modeled as coming from a poisson
distribution, with the canonical link being the log. We will do that
here and use the same predictors as in the mixed effects logistic,
predicting count from from Age, Married (yes = 1, no = 0), and
IL6 (continuous). We allow the intercept to vary randomly by each
doctor. We might make a summary table like this for the results.

Mixed Effects Poisson for Tumor Count

Parameter

Est.

SE

p-value

Exp(Est.)

Intercept

-.233

.057

<.001

.792

Age

.026

.001

<.001

1.026

Married (yes v no)

-.13

.013

<.001

.878

IL6

.005

.002

.025

1.005

\(\Sigma^2_{intercept}\)

.169

Npatients = 8,525

Ndoctors = 407

The interpretations again follow those for a regular poisson model,
for a one unit increase in Age, the expected log count of tumors
increases .026. People who are married are expected to have .13 lower log
counts of tumors than people who are single. Finally, for a one unit
increase in IL6, the expected log count of tumors increases .005.

It can be more useful to talk about expected counts rather than
expected log counts. However, we get the same interpretational
complication as with the logistic model. The expected counts are
conditional on every other value being held constant again including
the random doctor effects. So for example, we could say that people
who are married are expected to have .878 times as many tumors as
people who are not married, for people with the same doctor (or same
random doctor effect) and holding age and IL6 constant.

Like we did with the mixed effects logistic model, we can plot
histograms of the expected counts from our model for our entire
sample, holding the random effects at specific values. Here at the
20th, 40th, 60th, and 80th percentiles. This gives us a sense of how
much variability in tumor count can be expected by doctor (the
position of the distribution) versus by fixed effects (the spread of
the distribution within each graph).

This time, there is less variability so the results are less
dramatic than they were in the logistic example.

Finally, let’s look incorporate fixed and random effects for
each individual and look at the distribution of expected
tumor counts in our sample. that is, now both fixed
and random effects can vary for every person.

Other Issues

For power and reliability of estimates, often the limiting factor
is the sample size at
the highest unit of analysis. For example, having 500 patients
from each of ten doctors would give you a reasonable total number of
observations, but not enough to get stable estimates of doctor effects
nor of the doctor-to-doctor variation. 10 patients from each of 500
doctors (leading to the same total number of observations)
would be preferable.

For parameter estimation, because there are not closed form solutions
for GLMMs, you must use some approximation. Three are fairly common.

Quasi-likelihood approaches use a Taylor series expansion
to approximate the likelihood. Thus parameters are estimated
to maximize the quasi-likelihood. that is, they are not true
maximum likelihood estimates. A Taylor series uses a finite set of
differentiations of a function to approximate the function,
and power rule integration can be performed with Taylor series. With
each additional term used, the approximation error decreases
(at the limit, the Taylor series will equal the function),
but the complexity of the Taylor polynomial also increases. Early
quasi-likelihood methods tended to use a first order expansion,
more recently a second order expansion is more common. In general,
quasi-likelihood approaches are the fastest (although they can still
be quite complex), which makes them useful for exploratory purposes
and for large datasets. Because of the bias associated with them,
quasi-likelihoods are not preferred for final models or statistical
inference.

The true likelihood can also be approximated using numerical
integration. quadrature methods are common, and perhaps most
common among these use the Gaussian quadrature rule,
frequently with the Gauss-Hermite weighting function. It is also common
to incorporate adaptive algorithms that adaptively vary the
step size near points with high error. The accuracy increases as
the number of integration points increases. Using a single integration
point is equivalent to the so-called Laplace approximation.
Each additional integration point will increase the number of
computations and thus the speed to convergence, although it
increases the accuracy. Adaptive Gauss-Hermite quadrature might
sound very appealing and is in many ways.
However, the number of function evaluations required grows
exponentially as the number of dimensions increases. A
random intercept is one dimension, adding a random slope would
be two. For three level models with random intercepts and slopes,
it is easy to create problems that are intractable with Gaussian
quadrature. Consequently, it is a useful method when a high degree
of accuracy is desired but performs poorly in high dimensional spaces,
for large datasets, or if speed is a concern.

A final set of methods particularly useful for multidimensional
integrals are Monte Carlo methods including the famous
Metropolis-Hastings algorithm and Gibbs sampling which are types of
Markov chain Monte Carlo (MCMC) algorithms. Although Monte Carlo
integration can be used in classical statistics, it is more common to
see this approach used in Bayesian statistics.

Another issue that can occur during estimation is quasi or complete
separation. Complete separation means
that the outcome variable separate a predictor variable completely,
leading perfect prediction by the predictor variable. Particularly if
the outcome is skewed, there can also be problems with the random effects.
For example, if one doctor only had a few patients and all of them
either were in remission or were not, there will be no variability
within that doctor.