Abstract

Incomplete data are quite common in biomedical and other types of research, especially in longitudinal studies. During the last three decades, a vast amount of work has been done in the area. This has led, on the one hand, to a rich taxonomy of missing-data concepts, issues, and methods and, on the other hand, to a variety of data-analytic tools. Elements of taxonomy include: missing data patterns, mechanisms, and modeling frameworks; inferential paradigms; and sensitivity analysis frameworks. These are described in detail. A variety of concrete modeling devices is presented. To make matters concrete, two case studies are considered. The first one concerns quality of life among breast cancer patients, while the second one examines data from the Muscatine children’s obesity study.

1 Introduction

In a longitudinal study, each experimental or observational unit is measured at baseline and repeatedly over time. Incomplete data are not unusual under such designs, as many subjects are not available to be measured at all time points. In addition, a subject can be missing at one follow-up time and then measured again at one of the next, resulting in nonmonotone missing data patterns. Such data present a considerable modeling challenge for the statistician.

Rubin (1976) distinguished between three important mechanisms. When missingness is unrelated to the data, missingness is termed missing completely at random (MCAR). When missingness depends on the observed data and, when given the observed data, it does not depend on the unobserved data, the mechanism is missing at random (MAR). A mechanism where missigness depends on the unobserved data, perhaps in addition to the observed data, is termed missing not at random (MNAR). In the likelihood and Bayesian paradigm, and when mild regularity conditions are satisfied, the MCAR and MAR mechanisms are ignorable, in the sense that inferences can proceed by analyzing the observed data only, without explicitly addressing a (parametric) form of the missing data mechanism. In this situation, MNAR mechanisms are nonignorable. Note that frequentist inference is generally ignorable only under MCAR.

In the ignorable situation, standard longitudinal data software allowing for unbalanced data can be used. Examples include the SAS procedures MIXED, GLIMMIX, and NLMIXED, and the SPlus and R functions lme and nlme, etc… Such tools eliminate complete-case bias by incorporating all available information. However, in the nonignorable case, methods that do not model the missing data mechanism are subject to bias.

Whereas ignorable likelihood analyses and appropriate frequentist techniques, such as weighted generalized estimating equations (Robins et al. 1995), provide a versatile framework, as opposed to the collection of simple methods, such as complete case analysis or last observation carried forward, nonignorable missing data occur very commonly in longitudinal studies. In many cancer and AIDS clinical trials, the side effects of the treatment may affect participation, and missingness can depend on the outcome and the treatment covariate. In quality of life studies, compliance is not compulsory, and those with a poor prognosis may be more likely not to complete the questionnaire at every visit. In environmental studies, geographic location or environmental factors may influence the response. Examples of nonignorable missingness can also be found in longitudinal psychiatric studies (Molenberghs et al. 1997; Little and Wang 1996).

Estimating parameters with nonignorable missing data is complex. Likelihood-based methods require specification of the joint distribution of the data and the missing data mechanism. This specification can be further classified into three types of models: selection, pattern-mixture, and shared-parameter models (Little 1995). The selection approach models the hypothetical complete data together with the missing data process conditional on the hypothetical complete data. The pattern-mixture approach models the distribution of the data conditional on the missing data pattern. Both of these approaches will be discussed in this paper. The third approach, shared-parameter models, accounts for the dependence between the measurement and missingness processes by means of latent variables such as random effects (Wu and Bailey 1988, 1989; Wu and Carroll 1988; Creemers et al. 2009).

While other methods of estimation with nonignorable nonresponse will be considered briefly, likelihood-based frequentist methods using selection and pattern-mixture models will be the primary focus of this paper. The literature is just too enormous to review all possible inference paradigms in this paper, such as multiple imputation, Bayesian methods, and weighted estimating equations, for example. For the class of generalized linear models, Ibrahim et al. (2005) present a detailed overview and comparisons of the four main paradigms for handling missing covariate data, these being (i) maximum likelihood (ML), (ii) multiple imputation (MI), (iii) Bayesian methods, and (iv) weighted estimating equations (WEE).

The remainder of this section motivates the setting with two real longitudinal data sets with likely nonignorable missing data. Section 2 discusses types of missing data in longitudinal studies. Section 3 focuses on estimation in the normal random effects model. Section 4 discusses methods for estimation in the GLMM. Section 5 reviews methods for handling nonignorable missing covariate and/or response data in the GLMM. Shared-parameter models are the topic of Sect. 6. We give a brief discussion of Bayesian methods in Sect. 7 and give some concluding remarks in Sect. 8.

1.1 Motivating examples

As previously mentioned, many longitudinal studies call for estimation methods that can handle nonignorable missing data, since the possibility of such mechanism operation is impossible to rule out. This section presents two common examples to illustrate the problem in more detail.

Example 1: IBCSG data

Consider a data set concerning the quality of life among breast cancer patients in a clinical trial comparing four different chemotherapy regimens conducted by the International Breast Cancer Study Group (IBCSG Trial VI; Ibrahim et al. 2001). The main outcomes of the trial were time until relapse and death, but patients were also asked to complete quality of life questionnaires at baseline and at three-month intervals. Some patients did refuse, on occasion, to complete the questionnaire. However, even when they refused, the patients were asked to complete an assessment at their next follow-up visit. Thus, the structure of this trial resulted in nonmonotone patterns of missing data. One longitudinal quality of life outcome was the patient’s self-assessment of her mood, measured on a continuous scale from 0 (best) to 100 (worst). The three covariates of interest included a dichotomous covariate for language (Italian or Swedish), a continuous covariate for age, and three dichotomous covariates for the treatment regimen (4 regimens). Data from the first 18 months of the study were used, implying that each questionnaire was filled out at most seven times, i.e., at baseline plus at six follow-up visits.

There are 397 observations in the data set, and mood is missing at least one time for 71% of the cases, resulting in 116 (29%) complete cases. The amount of missing data is minimal at baseline (2%) and ranges between 24% and 31% at the other six times: 26.2% at the second, 24.2% at the third, 29% at the fourth, 24.9% at the fifth, 28.2% at the sixth, and 30.5% at the seventh occasion. Table 1 provides a summary of the missing data patterns; the overall fraction of missing measurements is 23.6%. All patients were alive at the end of 18 months, so no missingness is due to death. However, it is reasonable to conjecture that the mood of the patient affected their decision to fill out the questionnaire. In this case, the missingness would be MNAR, and an analysis that does not include the missing data mechanism would be biased. In fact, Ibrahim et al. (2001) show a slight difference in the significance of one of the treatment covariates and the age covariate between their ignorable and nonignorable models.

Example 2: Muscatine children’s obesity data

The Muscatine Coronary Risk Factor Study (MCRFS) was a longitudinal study of coronary risk factors in school children (Woolson and Clarke 1984; Ekholm and Skinner 1998). Five cohorts of children were measured for height and weight in 1977, 1979, and 1981. Relative weight was calculated as the ratio of a child’s observed weight to the median weight for their age-sex-height group. Children with a relative weight greater than 110% of the median weight for their respective stratum were classified as obese. The analysis of this study involves binary data (1 = obese, 0 = not obese) collected at successive time points. For every cohort, each of the following seven response patterns occurs: (p, p, p), (p, p, m), (p, m, p), (m, p, p), (p, m, m), (m, p, m), and (m, m, p), where a p (m) denotes that the child was present (missing) for the corresponding measurement. The distribution over the patterns is shown in Table 2.

The statistical problem is to estimate the obesity rate as a function of age and sex. However, as can be seen in Table 2, many data records are incomplete since many children participated in only one or two occasions of the survey. Ekholm and Skinner (1998) report that the two main reasons for nonresponse were: (i) no consent form signed by the parents was received, and (ii) the child was not in school on the day of the examination. If the parent did not sign the consent form because they did not want their child to be labeled as obese, or if the child did not attend school the day of the survey because of their weight, then the missingness would at least be MAR, and likely even MNAR. In the latter case, an analysis that ignores the missing data mechanism would be biased. However, since the outcome is binary, these data cannot be modeled using the normal random effects model. Instead, a general method for estimating parameters for the class of GLMM’s with nonignorable missing response data is needed.

2 Missing data in longitudinal studies

We will now formalize the ideas loosely described in the introduction. Methods for handling missing data often depend on the pattern of missingness and the mechanism that generates the missing values. To illustrate the various missingness patterns and mechanisms in a regression setting, consider a data set that consists of a univariate vector of responses yi = (yi1, …, yini)′ that may contain missing values, and an ni × p matrix Xi = (xi1, …, xini)′ of completely observed explanatory variables. We first define missing data patterns and then mechanisms.

2.1 Patterns of missing data

Data follow a monotone missing pattern if, once a subject misses a measurement occasion, s/he is never observed again. Monotone missing data are also termed dropout. For example, missing values in the vector of responses, yi take the dropout form if, whenever yij is missing, so are yik for all k ≥ j. Likelihood functions are easier to evaluate with monotone patterns of missing data since they can be factored in terms of conditional densities.

Data follow a nonmonotone missing pattern if at least some subject values are observed again after a missing value occurs. For example, if yi contains missing values, they are intermittent, and yij may be missing while yik is observed for some k > j. Likelihoods are more difficult to evaluate with nonmonotone patterns of missing data since almost always no simple factorization applies. In the MAR case, however, where ignorability applies, conventional software tools for longitudinal data models, allowing for unbalanced data, can be used to satisfaction.

2.2 Classifications of missing data mechanisms

Missing data are missing completely at random if the failure to observe a value does not depend on any values of yi, either observed or missing, or any other observed values. For example, suppose that some components of yi are missing while Xi is completely observed. The missing values of yi are MCAR if the probability of observing yi is independent of Xi and the values of yi that are observed or would have been observed. Under MCAR, the observed data is just a random sample of all the data. A complete-case analysis may lose efficiency, but no bias is introduced. Under MCAR, the missing data mechanism takes the simple form f (ri | Xi, ϕ) (where ϕ is a vector of unknown parameters), i.e., the outcomes do not intervene in the model for the missing-data indicators Ri = (Ri1, …, Rini)′, where Rij = 1 if Yij is observed and 0 otherwise.

Missing data are missing at random if the failure to observe a value does not depend on the values of yi which are unobserved, given the observed ones. However, the missingness may depend on other observed values. For example, suppose, as before, that Xi is completely observed while some components of yi may be missing. The missing values of yi are MAR if the probability of observing yi is independent of the values of yi that would have been observed but is not necessarily independent of the observed values of yi and Xi. This is a more realistic assumption than MCAR, but now adjustments must be made because observed responses are no longer a random sample. A complete-case analysis will be both inefficient and biased. Clearly, if data is MCAR, then it is MAR. For example, in a clinical trial, if missingness depends on the treatment allocation only, which has the status of a covariate, then the mechanism is MCAR and, a fortiori, also MAR. Under MAR, the missing data mechanism becomes f (ri | Xi, yobs,i, ϕ), where yobs,i denotes the observed components of yi.

The missing data mechanism is said to be missing not at random if the failure to observe a value depends on the value that would have been observed. For example, suppose that some components of yi are missing but that Xi is completely observed. The missing values of yi are MNAR if the probability that yi is missing depends on the missing values of yi, regardless of whether it depends on the observed values of yi or Xi. MNAR is the most general situation and is frequently encountered in longitudinal studies with repeated measures. Valid inferences generally require either specifying the correct model for the missing data mechanism, or distributional assumptions for yi, or both. The resulting estimators and tests are typically sensitive to these assumptions. Therefore, the mechanism should play a central role within so-called sensitivity analyses (Sect. 5.1). Under MNAR, the missing data mechanism is fully general, f (ri | Xi, yobs,i, ymis,i, ϕ).

Within the likelihood or Bayesian inferential frameworks, and when the parameters governing the measurement and missingness process are functionally independent, then MCAR and MAR mechanisms are ignorable. However, the frequentist framework generally requires the mechanism to be MCAR for ignorability to apply (Rubin 1976).

3 The normal random-effects model

The normal random-effects model, also known as the Laird–Ware model (Laird and Ware 1982), is a special case of the generalized linear mixed model, which is the subject of the next section. The model is intended for continuous, normally distributed outcomes. Precisely, for a given individual i with ni repeated measurements, the Laird–Ware model for outcome vector yi can be written as

yi=Xiβ+Zibi+ei,i=1,…,N,

(1)

where yi is ni × 1, Xi is an ni × p known matrix of fixed-effects covariates, β is a p × 1 vector of unknown regression parameters, commonly referred to as fixed effects, Zi is a known ni × q matrix of covariates for the q × 1 vector of random effects bi, and ei is an ni × 1 vector of errors. The columns of Zi are usually a subset of Xi, allowing for fixed effects as well as random intercepts and/or slopes. It is typically assumed that the ei’s are independent, the bi’s are i.i.d., the bi’s are independent of the ei’s, and

ei~Nni(0,σ2Ini),bi~Nq(0,D),

where Ini is the ni × ni identity matrix, and Nq(µ, Σ) denotes the q-dimensional multivariate normal distribution with mean µ and covariance matrix Σ. The positive definite matrix D is the covariance matrix of the random effects and is typically assumed to be unstructured and unknown. Under these assumptions, the so-called conditional model, where conditioning refers to the random effects, takes the form

(yi|β,σ2,bi)~Nni(Xiβ+Zibi,σ2Ini).

(2)

The model in (2) assumes a distinct set of regression coefficients for each individual once the random effects are known. Upon integration over the random effects, the so-called marginal distribution of yi is

(yi|β,σ2,D)~Nni(Xiβ,ZiDZi′+σ2Ini).

(3)

3.1 Complete-data estimation

We first describe maximum likelihood estimation in the normal mixed model with no missing response or covariate data. Maximum likelihood (ML) estimation has been extensively considered for the normal random effects model (see, for example, Laird and Ware 1982; Jennrich and Schluchter 1986). The standard approach is to take the first and second derivatives of the log-likelihood based on the marginal distribution of yi and use Newton–Raphson (based on the observed information) or Fisher scoring (based on the expected information) methods as the basis for iteratively obtaining the maximum likelihood estimates. Often, a hybrid approach to this iterative method is taken, where the updated value of is used to calculate = (2, ).

The method described here uses the expectation-maximization (EM) algorithm (Dempster et al. 1977) for computing ML estimates. The EM algorithm is advantageous over the Newton–Raphson or Fisher scoring algorithms when formulating models with large numbers of covariance parameters. The procedure consists of two steps. The first step uses weighted least squares ideas to update , which is equivalent to maximizing the likelihood with respect to β while holding the covariance parameters θ = (σ2, D) fixed. In the second step, is updated using Y = (y1, …, yN) as the observed data and V = (y1, b1, …, yN, bN) as the complete data.

Starting out with the first step, the log-likelihood based on the observed data, Y, is

where ei = yi − Xiβ − Zibi. One iterates between both steps until convergence.

Note that the EM algorithm converges linearly, in contrast to super-linear convergence of Fisher scoring and even quadratic convergence of Newton–Raphson. However, key advantages of the EM algorithm are that (1) implementation is frequently more straightforward and intuitive and (2) there is a much lower risk for divergence. Sometimes, hybrid algorithms can be used, setting out with EM and then switching to Fisher scoring or Newton–Raphson. Alternatively, EM-acceleration methods can be used (Louis 1982; Meilijson 1989). Such methods are also useful when determining measures of precision.

3.2 Estimation with nonignorable missing response data

When the missing data mechanism is MNAR, one needs to specify a (parametric) model for missingness alongside the aforementioned model for the outcomes and incorporate it into the complete data log-likelihood. The missing data mechanism is defined as the distribution of the ni × 1 random vector Ri, whose jth component rij = 1 if yij is missing and 0 otherwise. The distribution of Ri is indexed by the parameter vector ϕ and takes a multinomial form with 2ni cell probabilities. We assume here that the covariates are fully observed. Under the normal mixed model, the complete data density of (yi, bi, ri) for subject i is then given by f (yi, bi, ri |β, σ2, D, ϕ). Little (1993, 1995) identified two ways of factoring this joint distribution. Selection models decompose the joint distribution as (with covariates suppressed from notation)

f(yi,bi,ri|β,σ2,D,ϕ)=f(yi|β,σ2,bi)f(bi|D)f(ri|ϕ,yi),

whereas pattern-mixture models employ the reverse factorization

f(yi,bi,ri|β,σ2,D,ϕ)=f(yi|β,σ2,bi,ri)f(bi|D)f(ri|ϕ).

The term “pattern-mixture” emphasizes that the marginal distribution of y=(y1′,…,yN′)′ is a mixture of pattern-specific distributions. Most estimation methods assume that the distribution of ri depends on (yi, Xi, Zi) but not on bi. This assumption will be addressed in the discussion of models for the missing data mechanism.

3.2.1 Selection models

Estimation

where γ = (β, σ2, D, ϕ). Estimation of (β, σ2, D) is usually of interest with often, but not always, both the random effects and ϕ being viewed as nuisance parameters. Diggle and Kenward (1994) discuss estimation methods for selection models assuming monotone missing data. However, these methods are not easily extended to the analysis of nonmonotone missing data, where a subject may be observed after a missing value occurs. The method described next, based on the so-called EM by Method of Weights (Ibrahim 1990), is general in that it applies to both monotone and nonmonotone missing data.

For ease of exposition, write yi = (ymis,i, yobs,i), where ymis,i is the si × 1 vector of missing components of yi. The Monte Carlo EM (MCEM) algorithm has been used for parametric estimation in selection models with nonignorable missing response data (Ibrahim et al. 2001). The E-step consists of calculating the expected value of the complete data log-likelihood given the observed data and current parameter estimates. Since both bi and ymis,i are unobserved, they must be integrated over. Thus, the E-step for the ith observation at the (t + 1)st iteration is

Finally, for I3, bi can be easily integrated out since log[f (ri|ϕ, yi)] does not depend on bi. Therefore, I3 can be written simply as

I3=∫log[f(ri|ϕ,yi)]f(ymis,i|yobs,i,ri,γ(t))dymis,i.

(11)

The E-step, expressed via (9), (10), and (11) does not involve bi. Thus, to complete the E-step, we merely need to sample from [ymis,i|yobs,i, ri, γ(t)]. This distribution can be written, up to a constant of proportionality, as

which has the form of a normal density times a logistic regression for the ri’s. Thus, the distribution is from the class of concave log-densities, and Gibbs sampling from (12) is straightforward, using the adaptive rejection algorithm of Gilks and Wild (1992).

Precisely, the procedure is as follows. Let ui1, …, uimi be a sample of size mi from [ymis,i|yobs,i, ri, γ(t)], obtained via the Gibbs sampler along with the adaptive rejection algorithm of Gilks and Wild (1992). Also, let yi(k)=(uik′,yobs,i′)′ and

bi(tk)=Σi(t)Zi′(yi(k)−Xiβ(t))/σ2(t).

Then, the E-step for the ith observation at the (t + 1)th iteration takes the form

Models for the missing data mechanism

Diggle and Kenward (1994) proposed a binomial model for the missing data mechanism under the selection modeling approach, i.e.,

f(r|ϕ,y)=∏i=1N∏j=1ni[P(rij=1|ϕ)]rij[(1−P(rij=1|ϕ))]1−rij,

where P(rij = 1|ϕ) is modeled via a logistic regression involving all of the previous outcomes and the current outcome. This model takes the form

logit(P(rij=1|ϕ))≡log[P(rij=1|ϕ)1−P(rij=1|ϕ)]=ϕ0+ϕ1yij+∑k=2jϕkyj+1−k

for i = 1, …, N and j = 1, …, ni. The model can be extended to permit possible relationships between the missing data process and covariates, including time, by making ϕ0 a function of the covariates xqj at time tj. A linear function in the covariates could be written as

ϕ0=∑q=1rϕq0xqj.

(17)

For example, for the IBCSG data, consider a logistic regression model that includes the previous and current outcomes and treatment as covariates. Such a choice would specialize (17) to

logit(P(rij=1|ϕ))=ϕ0+ϕ1yi,j−1+ϕ2yij+ϕ3trtAi+ϕ4trtBi+ϕ5trtCi

for i = 1, …, N, j = 1, …, ni, and trtTi an indicator variable for whether subject i receives treatment T = A, B, C. Note that these models assume independence between the rij’s, in line with their conditional interpretation as probabilities of dropout given one is still at risk for dropping out.

A more general multinomial missing data model which incorporates a general correlation structure can be constructed by specifying the joint distribution of ri = (ri1, …, rini) through a sequence of one-dimensional conditional distributions (Ibrahim et al. 2001). Consider

where ϕa is a vector of indexing parameters for the ath conditional distribution and ϕ=(ϕn1′,…,ϕnN′)′. Thus,

f(r|ϕ,y)=∏i=1Nf(ri1,…,rini|ϕ,yi),

where f(ri1, …, rini|ϕ, yi) is given in (18). Since rij is binary, a sequence of logistic regressions can be used for (18). This modeling strategy has the potential of reducing the number of nuisance parameters that have to be specified for the missing data mechanism, yields general correlation structures between the rij’s, and allows more flexibility in the specification of the missing data model. It also accommodates nonmonotone patterns of missing data and provides a natural way to specify the joint distribution of the missing data indicators when knowledge about the missingness of one response affects the probability of missingness of another. One must be careful not to build a too large model for the missing data mechanism, since the model can easily become unidentifiable. Thus, caution should be taken when adding interaction terms or other higher-order terms. It is not clear how to characterize the set of all estimable parameters for this class of models given a certain choice of variables in the missing data mechanism. The parametric form of the assumed missing data mechanism is not testable from the data. Therefore, although a model may pass the tests for a certain missing data mechanism, this does not mean that one has captured the correct, and perhaps more complicated, missing data mechanism.

Also, it has been assumed throughout that [ri|ϕ, yi] does not depend on bi. This is a reasonable assumption in practice since autoregressive models for [ri|ϕ, yi] can closely approximate models for the missing data mechanism that include the random effect bi. In other words, conditional on the outcome vector yi, which contains information on the trajectory of the outcome over time, ri is independent of bi. In addition, the inclusion of a random effect in the missing data model induces a correlation structure across subjects in the marginal model [ri|ϕ, yi]. Note, however, that the correlation structure induced via a sequence of conditional distributions for [ri|ϕ, yi] as in (18) would also provide a suitable approximation to a correlation structure induced from a random effects model for the missing data mechanism. Little (1995) suggests using a model where missingness depends on the values of the random coefficients when the probability of missingness depends on current and past values of some latent variable that the outcome variable is measuring with error. However, including a random effect in [ri|ϕ, yi] makes the E-step substantially more computationally intensive, and all closed forms would be lost. A plausible alternative to the assumption, as suggested by Little (1995), is to model the missing data mechanism using the expected values of yi, rather than the actual values. In this case, (18) would then be written as

where γ = (β, σ2, D, ϕ). Since the distribution of yi depends on ri, a model based on this factorization implies that the marginal distribution of yi is a mixture of normal distributions rather than a single normal distribution as in the selection model. By conditioning on ri, this approach essentially stratifies the sample by the observed pattern of missing data and then models different distributions of yi over these patterns. Stratifying on the pattern is not always the most obvious way to go forward, as substantive interest usually concerns the mean and covariance matrix of yi averaged over pattern. However, it will be shown that inference for such parameters is not precluded in pattern-mixture models.

under the pattern-mixture factorization. The specification of distinct fixed parameters, (β(k), σ2(k)), creates major identification problems for each pattern because not all parameters of the complete data distribution of yi are estimable from incomplete pattern data. However, assumptions about the missing data mechanism can yield additional restrictions that help to identify the models, while avoiding explicit specification of the mechanism’s parametric form, such as required in the selection model approach. See also Verbeke and Molenberghs (2000) and Molenberghs and Verbeke (2005) for reviews.

To illustrate this idea, consider the analysis presented in Little and Wang (1996), in which pattern-mixture models are developed for a multivariate multiple regression. Suppose that we have a sample of N independent observations on p continuous outcome variables and q covariates, so that yi = (y1, …, yp)′ and xi = (x1, …, xq)′. Assume that xi and a subset of p1 rows of yi, denoted by y(1)i = (y1, …, yp1)′, are observed for all N cases and that the remaining p2 = (p − p1) rows of yi, denoted y(2)i = (yp1+1, …, yp)′, are observed for N0 cases and are missing for N1 = N − N0 cases. The indicator variable r is defined for observation i as ri = 0 if y(2)i is observed and ri = 1 if y(2)i is missing. Thus, we have a monotone missing data structure which can be found in longitudinal studies where subjects are lost to follow-up at the same time point. Now, pattern-mixture models are developed for this type of data using the model

where yi is p × 1, xi is a q × 1 vector of known covariates, β(k) is a p × q coefficient matrix of unknown regression parameters for pattern k, Σ(k) is a p × p unknown covariance matrix for pattern k, ri is an indicator variable for missingness, and ϕ is a q × 1 vector of unknown logistic regression parameters. Therefore, the total number of parameters to be estimated is 2pq + p(p + 1) + q. If we let θ(k) = (β(k), Σ(k)), k = 0, 1, and θ = (θ(0), θ(1)), then ϕ is distinct from θ and is estimated by standard methods for logistic regression of ri on xi. Note that the parameters of θ(1) cannot be directly estimated due to the missing data. However, these parameters can be identified by exploiting assumptions about the missing data mechanism. It should be noted that this model is more restrictive than the normal random-effects model of Laird and Ware (1982), which permits a distinct design matrix for each response and can incorporate random effects. It only encompasses models for repeated-measures data where the means are modeled as functions of between-subject covariates. Little (1995) considers random-effects models but does not give any details as to how pattern-mixture models would be developed.

The important step in developing pattern-mixture models is in making an assumption about the missing data mechanism. Suppose that

P(ri=1|y(1)i,y(2)i,xi)=g(y(2)i,xi),

(20)

where g is an arbitrary function of y(2)i and xi. Since missingness depends on the value of the missing variable, y(2)i, this is a nonignorable missing data mechanism. This assumption can then be converted into constraints on the parameters by factorizing the distribution of yi = (y(1)i, y(2)i) in pattern k as

where θ2(k)=(βx:2·x(k),Σ22·x(k)) consists of the (p2 × q) regression coefficient matrix and (p2 × p2) residual covariance matrix for the regression of y(2)i on xi within pattern k, and θ1·2(k)=(β2:1·2x(k),βx:1·2x(k),Σ11·2x(k)) consists of the (p1 × p2) and (p1 × q) regression coefficient matrices and (p1 × p1) residual covariance matrix for the regression of y(1)i on y(2)i and xi within pattern k.

Note that assumption (20) states that [ri|y(1)i, y(2)i, xi] y(1)i, which implies in turn that [y(1)i|y(2)i, xi, ri] ri, where indicates independence. In other words, the conditional distribution of y(1)i given y(2)i and xi is the same for both patterns, so that

θ1·2(0)=θ1·2(1)=θ1·2.

(21)

This yields p1(p2+q)+p1(p1+1)2 restrictions that help to identify the model, and likelihood inference now depends on the relative sizes of p1 and p2.

If p1 = p2, then the number of restrictions in (21) equals the number of unidentified parameters, and the model is just identified. Maximum likelihood (ML) estimates for the identified parameters are obtained by standard complete data methods, namely two multivariate regressions of y(1) on x and one multivariate regression of y(2) on y(1) and x. For example,

β^x:1·x(0)=(X′X)−1X′Y,Σ^11·x(0)=1N0(Y−Xβ^x:1·x(0))′(Y−Xβ^x:1·x(0)),

where Y is an N0 × p1 matrix of responses, and X is an N0 × q matrix of covariates. The estimates of interest, however, are from [y(1)i, y(2)i|xi], averaged over patterns, (βx:1·x, Σ11·x, βx:2·x, Σ22·x, Σ21·x). These can be expressed as functions of the identified parameters and ϕ by applying the identifying restrictions. The following ML estimates are then obtained by substituting the ML estimates of the identified parameters and ϕ into these functions:

where x = P(ri = 1|, xi). A modification of these equations is required if the resulting covariance matrices are not positive semidefinite. Specifically, if Σ^11·x−Σ^11·x(0) is not positive semidefinite, it is replaced by P Q P′, where P is the orthogonal matrix of eigenvectors of (Σ^11·x−Σ^11·x(0)), and Q is the diagonal matrix of eigenvalues of (Σ^11·x−Σ^11·x(0)) with the negative elements replaced by zero.

If p1 > p2, then the number of restrictions in (21) exceeds the number of unidentified parameters, and the model is overidentified. Explicit ML estimates cannot be obtained, and an iterative method such as the EM algorithm is required. The complete data log-likelihood of θ is

The E-step at each iteration replaces these statistics by their expected values given the observed data and current parameter estimates, which can be calculated from the first and second moments of y(2)i:

The M-step computes new parameter estimates by a complete-data maximization subject to the constraints induced by the missing data assumption. Therefore, for the restrictions of (21), the likelihood function for the complete data is rewritten as

Note that the E-step requires the regression of y(2)i on y(1)i and xi for the pattern with missing data, whereas the M-step requires the regression of y(2)i on xi for each pattern and y(1)i on y(2)i and xi pooled over patterns. The sweep operator (Little and Rubin 2002, Chap. 6) facilitates the switching of the regressions needed for the E- and M-steps. Specifically, θ2·1(1)=(β1:2·1x(1),βx:2·1x(1),Σ22·1x(1)) are obtained by sweeping on the second and third blocks of the matrix

D=[D11−1βx:1·2x′D12−1βx:1·2xΣ11·2xβ2:1·2xD12−1β2:1·2x′D22−1],

where

[D11D12D21D22]=−N1[Sxx(1)Sx2(1)′S2x(1)S22(1)]−1.

Note that the elements of D are calculated using the parameter estimates from the previous M-step. Once the E-step is completed, the missing parts of the following matrix, A1, can be filled in, leading to

A1=1N1[Sxx(1)Sx1(1)Sx2(1)Sx1(1)′S11(1)S21(1)′Sx2(1)′S12(1)S22(1)].

If we let

A0=1N0[Sxx(0)Sx1(0)Sx2(0)Sx1(0)′S11(0)S21(0)′Sx2(0)′S12(0)S22(0)]

and A = (N0A0 + N1A1)/N, then the M-step is completed by sweeping on the first block of A0 and A1 to obtain θ2(0)=(βx:2·x(0),Σ22·x(0)) and θ2(1)=(βx:2·x(1),Σ22·x(1)), respectively. The values of θ1·2 = (β2:1·2x, βx:1·2x, Σ11·2x) are obtained by sweeping on the first and third blocks of A. Notice that θ2(0) is not affected by the E-step and so does not need iteration. This process yields ML estimates of (βx:2·x(0),Σ22·x(0),βx:2·x(1),Σ22·x(1)) ML estimates of (βx:1·x(0),Σ11·x(0),βx:1·x(1),Σ11·x(1)) can be obtained from the regression of y(1)i on xi for each pattern, and ML estimates of ϕ can be obtained from a logistic regression of ri on xi. The following functions of these ML estimates yield ML estimates of the parameters of interest:

If p1 < p2, then the model remains underidentified, and additional restrictive assumptions are needed to identify the model parameters. Little and Wang (1996) suggest assuming

P(ri=1|y(1)i,y(2)i,xi)=g(y(2s)i,xi),

where y(2s)i is a subset of the variables y(2)i with dimension p(2s) ≤ p1. Using this approach, inference follows directly from the two scenarios previously described (p1 = p2 case when p1 = p(2s) and p1 > p2 case when p1 > p(2s)). The choice of subset variables is important to the success of the model, and reasons for dropout should be determined.

3.2.3 Discussion of selection and pattern-mixture models

All likelihood-based methods for handling nonignorable missing data must make some unverifiable assumptions, since the missing data mechanism included in the model depends on unobserved responses. Such a model is essentially nonidentifiable unless some unverifiable constraints are imposed. Inferences are only possible once these assumptions have been made, and the following aspects of the model need to be carefully considered: the bias and efficiency of parameter estimates, sensitivity to model specification, computational expense, and ease of implementation and interpretation. Selection and pattern-mixture models represent two different methods for handling nonignorable missing longitudinal data; each has its advantages and disadvantages.

Selection models directly model the distribution of primary interest, that is, the marginal distribution of the longitudinal outcomes. Thus, this method is more intuitive to most investigators. Selection models allow for a more natural way to model the missing data process, and since the missing data mechanism is modeled conditional on the repeated outcomes, it is very easy to formulate hypotheses about the missing data mechanism. However, to ensure identifiability, the set of outcomes is usually restricted in some way, and arbitrary constraints must be applied to the missing data model. It is unclear how these restrictions on the missing data mechanism translate into assumptions about the distribution of the unobserved outcomes. Sensitivity of parameter estimates to model assumptions need to be considered, as well as the complexity of the computational algorithms required to fit the models.

Pattern-mixture models make specific assumptions about the distribution of the unobserved outcomes, and therefore, it may be easier to explore the sensitivity of results to model specification. By modeling the outcomes separately for each pattern, problems of identifiability are made explicit. Model identifiability is more obscure in the selection modeling approach, and in this case, one needs to characterize identifiability theoretically. Chen et al. (2004a, 2006, 2009) have carried out such investigations. The main drawback of pattern-mixture models is that the parameters of interest are not immediately available. The primary focus of inference is on the marginal distribution of the outcomes, which can only be obtained by averaging over patterns. Hence, one cannot examine the effects of the individual covariates on the marginal distribution of the outcomes in terms of the regression coefficients. Also, as shown in the previous section, the computations needed for a simple multivariate multiple regression with just one pattern of missing data are complex. It is possible that pattern-mixture models may be computationally intractable for random-effects models or more general patterns of incomplete data.

3.2.4 Conditional linear models

Several methods have been proposed for dealing with series of measurements that may be right censored due to death or withdrawal. The right censoring is termed informative if the censoring probabilities depend on an individual subject’s underlying rate of change (slope) of the outcome variable. Thus, informative censoring is a special type of nonignorable missing data, and the class of joint models for longitudinal data and a nonignorable censoring process represent a specific case of the selection model. Wu and Carroll (1988) combine the normal random effects model with a probit model for the censoring process. They derive pseudo-maximum likelihood estimates and refer to their procedure as probit pseudo-maximum likelihood estimation (PPMLE). Wu and Bailey (1989) prove that under the probit model, the expectation of the slope for subject i is a monotonic increasing (decreasing) function of the censoring time, and instead of modeling the censoring process, they propose a conditional linear model for the individual least squares estimated slope. This method can be described as an approximation to account for the informative right censoring when estimating and comparing changes of a continuous outcome variable.

Consider the following general framework. Assume that in a longitudinal study, n measurements on the outcome variable are planned to be made for each participant and that the participants are to be allocated into two equal sized treatment groups. Let yi = (yi1, …, yiji) be the observed outcome vector of serial measurements for subject i, where ji ≤ n. The repeated measurements of yi are assumed to follow linear functions of time with normally distributed errors. Let βi = (βi0, βi1)′ be the unobserved vector representing the true intercept and slope of the outcome variable for the ith subject, and let (i|ji) = i be the usual least squares estimate of βi based on the ji observations. Furthermore, assume that when the ith subject belongs to the kth group, k = 1, 2, i follows a bivariate normal distribution. Thus

yi=Xiβi+ei,

where

ei~Nni(0,σ2Iji),βi~N2(βk,D),

and

β^i|ji=(Xi′Xi)−1Xi′yi,

where

Xi′=[1⋯1t1⋯tji].

The conditional linear model approach writes the slope as a linear function of the censoring time with normal errors. Specifically,

(β^i1|ji=j)=γ0k+γ1tj+ekj,

(23)

where E(ekj) = 0 and Var(ekj)=σkj2. Two methods to estimate the expected slopes, βk1, were proposed by Wu and Bailey (1988, 1989). The linear minimum variance unbiased (LMVUB) procedure estimates γ0k and γ1 by weighted least squares so that

LMVUB(βk1)=γ^0k+γ^1Eik(tji),

where Eik(tji) is the expected value of the censoring time for the kth group (i.e., the sample mean for the kth group). The linear minimum mean squared error (LMMSE) estimate is a linear combination of the individual least squares slope estimates with the weights, Wkj, chosen to minimize the mean squared error under the linear model of (23) so that

LMMSE(βk1)=∑j=2nWkj(β^¯k1),

where

β^¯k1=∑iεk(β^i|ji=j)nkj,

with nkj denoting the number of subjects censored after j measurements were taken in the kth group. Wu and Bailey (1988) review PPMLE, LMVUB, and LMMSE and compare these approaches together with the weighted and unweighted least squares estimates in the presence of informative censoring. Schluchter (1992) proposes a log-normal survival model which is a generalization of the conditional linear model that allows staggered patient entry and uses the exact censoring times of each individual.

4 Generalized linear mixed models

The generalized linear mixed model (GLMM) is the generalized linear model (GLM) extension of the normal linear random effects model. It is defined as follows. Suppose that the sampling distribution of yij, j = 1, …, ni, i = 1, …, N, is from an exponential family, so that

f(yij|θij,τ)=exp{τ[yijθij−a(θij)]+c(yij,τ)},

(24)

where τ is a scalar dispersion parameter. Except for the normal random effects model, it will be assumed that τ = τ0, where τ0 is known, since τ0 = 1 in the logistic and Poisson regression models. The yij’s are assumed to be independent given the random effects, and each yij has canonical parameter θij, which is related to the covariates by θ(ηij), where ηij=xij′β+zij′bi, and xij′ is a 1 × p vector denoting the jth row of Xi, while zij′ is a 1 × q vector denoting the jth row of Zi. The link function, θ(·), is a monotonic differentiable function. When θij = ηij, the link is said to be canonical. Note that the GLMM has similarity with the normal random effects model in that we assume that conditional on the random effects, bi, the repeated observations on subject i are independent. Letting y = (y11, …, yNnN)′, X=(X1′,…,XN′)′,Z=diag(Z1,…,ZN), and b=(b1′,…,bN′)′, the full likelihood based on N subjects for the GLMM is given by

f(y,b|β,D)=∏i=1N∏j=1nif(yij|β,bi)f(bi|D),

where f(bi|D) is the distribution of bi. As usual, it is assumed that bi ~ Nq(0, D), so that

f(bi|D)=(2π)−q/2|D|−1/2exp{−12bi′D−1bi}.

To induce a correlation structure on the responses, inference is based on the marginal likelihood of β and D with the random effects integrated out. This is given by

f(y|β,D)=∫RNqf(y,b|β,D)db,

(25)

where RNq denotes the Nq dimensional Euclidean space.

4.1 Complete-data estimation

If y and X are fully observed, then the likelihood function based on the observed data is given by (25). Note that

Thus, the marginal likelihood involves evaluating Nq-dimensional integrals. For the general class of GLMM’s, these integrals do not have a closed form and are very difficult to evaluate. This problem led to the development of quasi-likelihood based methods. Quasi-likelihood was first introduced for the generalized linear model by Wedderburn (1974), who defined the quasi-likelihood function as follows. Suppose that yi, i = 1, …, N, is a set of observations with expectation E(yi|β) = μi and variance Var(yi|β) = a(τ)V(µi), where V(µi) is some known function. The quasi-likelihood function, Q(yi, μi), is defined by the relation

∂Q(yi,μi)∂μi=yi−μiV(μi).

The log-likelihood is a special case of the quasi-likelihood function, but Wedderburn (1974) showed that one can use any function Q(yi, µi) that satisfies the above definition as a basis for defining a GLM and obtaining estimates of the β’s. In other words, GLM’s can be used for any random variable as long as the mean, the mean function, the variance function, and the scale parameter are known.

In the GLMM, the conditional distribution of [y|β, b] plays the same role as the distribution of [y|β] in the fixed-effects GLM, and the joint quasi-likelihood function is the sum of the quasi-likelihoods of [y|β, b] and [b|D]. Since inference is based on the marginal likelihood of β and D with the random effects integrated out, an integrated quasi-likelihood function is used to estimate θ = (β, D). This is defined by

exp(Q(y,μ|b))∝|IN⊗D|−1∫exp{(−12∑i=1Ndevi)−12b′(IN⊗D)−1b}db,

where devi denotes the deviance measure of fit for subject i, b is the Nq × 1 vector of the bi’s, IND is the Nq × Nq covariance matrix of b, and the scalar dispersion parameter is assumed to equal one. Breslow and Clayton (1993) apply Laplace’s method to approximate this function and show that

Q(y,μ|b)≈[y′θ(η)−J′a(θ(η))]−12b′(IN⊗D)−1b,

where y is the n×1(n=∑i=1Nni) vector of the yij’s, θ(η) is the n × 1 vector of the θ(ηij)’s, J is a column vector of ones, and a(θ(η)) is the n × 1 vector of the a(θ(ηij))’s. Differentiation with respect to β and b leads to score equations for these parameters, and solutions can be obtained via Fisher scoring by iteratively solving

[X′WXX′WZZ′WXZ′WZ+(IN⊗D)−1][βb]=[X′WY*Z′WY*],

where W = G R−1G, Y* = + (y − )G−1, G=dμdη, and R = Var(y|β, b). Substitution of and into the approximated quasi-likelihood function and evaluation of W at and generate an approximate profile quasi-likelihood function for inference on D. Breslow and Clayton (1993) show that differentiating a REML version of this function with respect to the components of D yields the following estimating equations for the variance parameters:

−12[(Y*−Xβ)′V−1∂V∂DijV−1(Y*−Xβ)−tr(P∂V∂Dij)]=0,

where V = W−1 + Z(IND)Z′ and P = V−1 − V−1X[(X′V−1X)−1X′V−1]. Breslow and Clayton (1993) call their procedure penalized quasi-likelihood (PQL) and assume that the scale parameter τ equals one. Wolfinger and O’Connell (1993) developed a refinement of PQL called pseudo-likelihood (PL), which assumes that τ is unknown, and PQL is simply a special case of PL when τ = 1. The method is implemented in the SAS procedure GLIMMIX, which recently has been augmented with the Laplace approximation and numerical quadrature as well. Other software packages, such as R and MLwiN, have functions and procedures for PQL estimation too, such as the R function glmmPQL. Note that MLwiN allows for second-order PQL.

Alternatively, indeed, numerical integration methods have been proposed, based on so-called nonadaptive or adaptive Gaussian quadrature. The first of these methods implements a conventional quadrature rule. The second one makes use of the bell-shaped form of the conditional likelihood function, focusing attention on the portion with highest mass. While more accurate than the PQL and PL methods, numerical integration can be computationally intensive and very sensitive to starting values. It has been implemented in the SAS procedures GLIMMIX and NLMIXED.

4.2 Estimation with nonignorable missing response data

When some components of y are nonignorably missing, the estimation problem based on the observed data likelihood in (26) becomes more complicated since another integral over the missing data and the missing data mechanism would be introduced. Ibrahim et al. (2001) have developed a Monte Carlo EM algorithm for the selection model that facilitates straightforward estimation of β and D. Less work has been done in estimating parameters for the GLMM with nonignorable missing data using a pattern-mixture modeling approach. Fitzmaurice and Laird (2000) propose a method based on generalized estimating equations (Liang and Zeger 1986), but theirs rather is an extension of Wu and Bailey’s conditional linear model (Wu and Bailey 1988, 1989) than a pattern-mixture model as described by Little (1993, 1995).

4.2.1 Selection models

Recall that the complete data log-likelihood for the selection model is given by (5), where now f(yi|β, bi) is the GLMM given in (24). Assume that yi contains arbitrary and nonmonotone patterns of missingness so that some permutation of the indices of yi can be written as yi = (ymis,i, yobs,i). Ibrahim et al. (2001) use the Monte Carlo version of the EM algorithm for parameter estimation in the GLMM selection model with nonignorable missing response data. They write the E-step for an arbitrary GLMM in a weighted complete data form by using the general form of the EM by the Method of Weights (Ibrahim 1990). Recall further that the E-step for the ith observation at the (t + 1)st iteration can be written as (6), where γ(t) = (β(t), D(t), ϕ(t)), and f(ymis,i, bi|yobs,i, ri, γ(t)) represents the conditional distribution of the “missing” data, (ymis,i, bi), given the observed data. The Monte Carlo EM algorithm given by Wei and Tanner (1990) requires generating a sample from

[ymis,i,bi|yobs,i,ri,γ(t)]

for each i. This can be done via the Gibbs sampler by sampling from the complete conditionals [ymis,i|yobs,i, bi, ri, γ(t)] and [bi|ymis,i, yobs,i, ri, γ(t)]. Note that

f(ymis,i|yobs,i,bi,ri,γ(t))∝f(yi|bi,γ(t))f(ri|yi,γ(t))

(27)

and

f(bi|ymis,i,yobs,i,ri,γ(t))∝f(yi|bi,γ(t))f(bi|γ(t)).

(28)

The products on the right side of (27) and (28) are log-concave for the class of GLMM’s in (24). This is true since f(yi|bi, γ(t)) is log-concave in the components of yi and f(ri|yi, γ(t)) will be log-concave in the yi ’s if each [ri|yi, γ(t)] is taken to be a logistic regression model. Also, f(yi|bi, γ(t)) and f(bi|γ(t)) are both log-concave in the components of bi. Since the sum of the logarithms of log-concave densities is a concave function, the Gibbs sampler along with the adaptive rejection algorithm of Gilks and Wild (1992) can be used to sample from

Suppose that for the ith observation, a sample of size mi, υi1, …, υimi, is taken from the joint distribution of [ymis,i, bi|yobs,i, ri, γ(t)] via the Gibbs sampler described by (29) and (30) in conjunction with the adaptive rejection algorithm as discussed above. Note that each υik will be an (si + q) × 1 vector for k = 1, …, mi and that each υik depends on the iteration number which is suppressed. The E-step for the ith observation at the (t + 1)st iteration can now be written as

Note that this E-step takes a complete data weighted form in which each (ymis,i, bi) gets filled in by a set of mi values, each contributing a weight of 1/mi. The E-step for all of the observations is given by

Q(γ|γ(t))=∑i=1N∑k=1mi1mil(γ;yobs,i,υik,ri).

The resulting M-step is like one of complete data for the GLMM and can be obtained as follows. Let

Q˙(γ|γ(t))=(Q˙(1)(β|γ(t)),Q˙(2)(D|γ(t)),Q˙(3)(ϕ|γ(t)))′

denote the score vector of Q(γ|γ(t)) so that

Q˙(γ|γ(t))≡∑i=1NQ˙i(γ|γ(t))=∑i=1N∑k=1mi1mi∂l(γ;yobs,i,υik,ri)∂γ.

Also, let

Q¨(γ|γ(t))≡∂2Q(γ|γ(t))∂γ∂γ′

denote the Hessian matrix. Since β, D, and ϕ are distinct, derivatives of l(γ; yobs,i, υik, ri) are straightforward to compute, and (γ|γ(t)) is block diagonal in β, D, and ϕ. Computation of the asymptotic covariance matrix of can be done using Louis’s (1982) method. The estimated observed information matrix of γ based on the observed data is given by

The quantities in (32) are easily computed since both (|) and i(|) are obtained from the M-step and Si(; yobs,i, υik, ri) is easily obtained outside of the EM algorithm.

The method described here is valid for arbitrary patterns of missing data in the response variable. The complexity of the estimate of D in the M-step depends on the structure of D. In any case, the estimation of D corresponds to estimation from a problem of complete data, and one can use any existing complete data software to estimate D. Also, note that models for the missing data mechanism in GLMM’s are the same as the normal random-effects model.

4.2.2 Pattern-mixture models

Recall that pattern-mixture models stratify the incomplete data by the pattern of missing values and formulate distinct models within each stratum. Thus, the complete data log-likelihood is written as

l(γ)=∑i=1Nlog[f(yi|β,bi,ri)]+∑i=1Nlog[f(bi|D)]+∑i=1Nlog[f(ri|ϕ)].

Little work has been done using pattern-mixture models for GLMM’s with nonignorable missing data. Ekholm and Skinner (1998) analyze longitudinal binary data using a pattern-mixture model but do not generalize their method to the GLMM. Fitzmaurice and Laird (2000) develop a model for the GLMM with nonignorable dropout, which they consider to be a mixture model based on Wu and Bailey’s conditional linear model (Wu and Bailey 1988, 1989), since dropout time is used as a covariate. This is the method that will be described here.

Consider the following notation. Assume that N subjects are to be observed at the same set of n occasions, {t1, …, tn}. Let yic=(yi1,…,yin)′ denote the complete response vector for subject i, and let Xi denote the n × p matrix of covariates for yic. Each subject also has an event time, ri, denoting the dropout time, which is thought to be related to yic. Note that dropout implies that no subsequent repeated measures are made, so if ri ≤ tn, then the ith subject is a dropout. ri is considered to be discrete and occurring at tj+1 if the response at tj+1 is not observed. Let ϕij = P(ri = tj) and assume that ϕi1 = 0 for all i. An additional category, ϕi(n+1), is included for the completers. The observed data for each subject consists of (yi, Xi, ri).

Consider models for yi, conditional of the time of dropout, that are of the following general form:

g(E(yij|β,ri))=zij′β,

(33)

where g(·) is a known link function, and the design vector, zij, includes the dropout time, the covariates, and their interactions. The parameters in this model have an unappealing interpretation due to the stratification by pattern of dropout, which may depend on the outcome. Therefore, the parameter of interest is not β, but the marginal expectation of the repeated outcome averaged over the distribution of dropout times,

E(yij|β)=μij=∑l=2n+1ϕilg−1(zij′β),

where zij includes the dropout time and xij, and ϕil depends on Xi. Since this estimate has been averaged over the distribution of the dropout times, the marginal mean will not, in general, follow the link function model assumed in (33). Therefore, the zij should be saturated in any covariate effects of interest so that comparisons can be made in terms of the marginal means.

Unlike the normal random effects model, it is difficult to account for the covariance among the repeated outcomes when the response variable is categorical, ordinal, or count data. Generalized estimating equations (GEE’s) (see Liang and Zeger 1986; Zeger and Liang 1986) represent a general method for incorporating within-subject correlation in the GLM without having to completely specify the joint distribution of yi. Only the forms of the first and second moments are required. Note that the GEE approach can accommodate any intermediate MCAR missingness in the outcome since each subject is allowed a distinct set of measurement times. The estimating equations for β with nonignorable missing data are given by

U(β)=∑i=1NGi′Vi−1[yi−E(yi|β,ri)]=0,

where yi is the ni × 1 vector of observed responses, Gi=∂E(yi|β,ri)∂β, and Vi is the ni × ni working covariance matrix of yi. Note that Vi depends on the marginal means, E(yij|β, ri), and a set of association parameters, ρ. Typically ρ is unknown and can be estimated with another set of estimating equations. It can be shown that N1/2( − β) has an asymptotic normal distribution with mean 0 and covariance matrix

Estimation of the dropout probabilities also needs to be considered. With a small number of discrete covariates, the dropout probabilities, ϕij, can be estimated as the sample proportion with each dropout time stratified by covariate pattern. The asymptotic covariance matrix of N1/2( − ϕ) is then given by

Vϕ^=diag(ϕ)−ϕϕ′.

When the number of dropout times or covariates is large, then parametric models such as a multinomial log-linear regression model can be used to estimate ϕ.

The appealing aspect of the mixture model presented above is that the SAS procedure GENMOD, or any other statistical software for GEE’s, can be used to estimate β. The dropout times and their interactions with the other covariates are simply included as additional covariates in the model. The marginal means at times tj can then be estimated by

where kt(Xi; β) is a known smooth function of β, t = 1, …, T, and β is a p × 1 vector of unknown parameters on which inferences are to be made. Note that this model places no restrictions on the conditional mean of yit given Xi at any time t and so is referred to as nonparametric.

Consider the following notation. Let υit be a vector of time-dependent covariates that are not of interest. Define wit=(υit′,yit)′,t=1,…,T,wi0=(xi0′,υi0′,yi0)′,wi=(wi0′,wi1′,…,wiT′)′, and let rit be an indicator variable for time t, t = 1, …, T, that takes the value 1 if wit is observed and 0 otherwise. Let πi(1) = P(ri = 1′|wi) be the conditional probability of observing the full data, wi, for the ith subject given wi. In addition, suppose that given wi, ri is a vector of possibly correlated binary variables taking values in the set {r = (r1, …, rT)′ : r = 0 or 1, 1 ≤ t ≤ T}. Letting it = (ri1, …, ri(t−1))′ and defining i1 = 1, the conditional distribution of ri given wi is

where d(1)(Xi; β) and d(2)(Xi; β) are p × T and q × T fixed functions of Xi and β, and Ai(1)(α) and Ai(2)(α) are defined as

Ai(j)(α)=∑r≠1[I(ri=r)−I(ri=1′)πi(1;α)πi(r;α)]fr(j)(wri)

such that

∑i=1NAi(j)(α)=0.

Rotnitzky et al. (1998) show that the solution to these equations is consistent and asymptotically normally distributed, provided that the conditional mean model and the model for the response probabilities are correctly specified. The variance of depends on the choice of functions d(j)(Xi; β) and fr(j)(wri), and optimal choices for these functions are discussed in the paper. This approach is extended to semiparametric models for the dropout mechanism by Scharfstein et al. (1999). Lipsitz et al. (1999b) examine theoretical connections between WEE and ML methods.

which has the form of a normal density times a logistic regression for the ri’s times some sort of regression for the Xmis,i’s. If this distribution is from the class of concave log-densities, then Gibbs sampling from (34) is straightforward using the adaptive rejection algorithm of Gilks and Wild (1992).

This methodology has been extended to the GLMM by Stubbendick and Ibrahim (2006). Thus, an MCEM sample must be generated from [ymis,i, Xmis,i, bi|yobs,i, Xobs,i, ri, γ(t)] for each i. This can be done using the Gibbs sampler by sampling from the complete conditionals, [ymis,i|yobs,i, Xmis,i, Xobs,i, bi, ri, γ(t)], [Xmis,i|ymis,i, yobs,i, Xobs,i, bi, ri, γ(t)], and [bi|ymis,i, yobs,i, Xmis,i, Xobs,i, ri, γ(t)]. Note that

When the products on the right-hand side of (35)–(37) are log-concave for the class of GLMMs, then the Gibbs sampler along with adaptive rejection algorithm of Gilks and Wild (1992) can be used to sample from the complete conditionals.

Allowing for nonignorable missing responses and covariates presents several additional modeling and computational challenges compared to just the missing response situation. First, a covariate distribution needs to be specified and its parameters estimated. This is done by specifying the covariate distribution via a sequence of one-dimensional conditional distributions as

where xijm is the mth covariate for individual i at time j, αk is a vector of indexing parameters for the kth conditional distribution, α=(α1′,…,αp′)′, and the αk’s are distinct. Note that (38) only needs to be specified for those covariates that are missing. Second, identifiability of the model needs to be carefully considered. Third, efficient computational strategies are needed since this model can be computationally intensive in general.

5.1 Model assessment and sensitivity

Unfortunately, the parametric forms of the assumed missing data mechanism and the covariate model are not testable from the data. Many models need to be evaluated owing to the numerous possibilities for the missing data mechanism and/or the covariate distribution and for carrying out sensitivity analyses. In addition, issues related to bias, efficiency, and model fit need to be addressed. In the presence of missing data, Lipsitz et al. (2001) and Fitzmaurice et al. (2001) examine bias issues in longitudinal data. Chen et al. (2008) examine bias and efficiency issues in regression models with missing responses and/or covariates. To address general issues regarding model fit and assessment in the presence of missing data, new methods are needed for defining residuals, diagnostic measures, assessing model fit, and assessing the influence of model perturbations for all types of models, such as GLMs, survival models, and models for longitudinal data. This is a currently growing, active, and open research area. AIC and BIC are common model assessment tools under the frequentist paradigm. In the presence of missing data, the definition of the AIC/BIC criterion is not clear. Ibrahim et al. (2008b) define AIC as AIC = −2Q(|) + 2d, where d is the total number of parameters in the model and Q(|) is the Q function from the EM algorithm at convergence. Similarly, they define BIC as BIC = −2Q(|) + log(N)d. Such measures can be used to assess fit in models for longitudinal data.

A more general framework for model assessment in complete data problems is given in Cook (1986), where he describes a method for assessing the local influence of minor perturbations of a statistical model. His method uses the geometric normal curvature to characterize the behavior of an influence graph based on a well-behaved likelihood function. In the context of the linear mixed model with complete data, Beckman et al. (1987) use local influence to assess the effect of perturbing the error variances, the random-effects variances, and the response vector. Lesaffre and Verbeke (1998) show that the local influence approach is also useful for the detection of influential subjects in a longitudinal data analysis. Zhu and Lee (2001) apply Cook’s approach to the conditional expectation of the complete data log-likelihood function in the EM algorithm instead of the more complicated observed data log-likelihood function. Their Q-displacement function, 2[Q(|) − Q((ω)|)], was used in assessing the local influence of perturbations of selection models with nonignorable missing data in Shi et al. (2009). Zhu et al. (2009) examine residuals, diagnostic measures, and goodness of fit statistics for GLMs with missing covariate data. Shi et al. (2009) also examine local influence approaches for GLMs with missing covariate data, and Garcia et al. (2009) investigate variable selection in GLMs with missing covariate data using penalized likelihood approaches. These procedures are currently being extended to longitudinal data.

Broadly speaking, there are three main reasons to consider such models. First, a time-to-event outcome may be measured alongside a longitudinal covariate. Such a joint model then allows, in a natural way, for incorporation of measurement error present in the longitudinal covariate into the model. Second, a number of researchers have used joint modeling methods to exploit longitudinal markers as surrogates for survival. Tsiatis et al. (1995), for instance, propose a model for the relationship of survival to longitudinal data measured with error and, using Prentice’s (1989) criteria, examine whether CD4 counts may serve as a useful surrogate marker for survival in patients with AIDS. Xu and Zeger (2001) investigate the issue of evaluating multiple surrogate endpoints and discuss a joint latent model for a time to clinical event and for repeated measures over time on multiple biomarkers that are potential surrogates. In addition, they propose two complementary measures to assess the relative benefit of using multiple surrogates as opposed to a single one. Another aspect of the problem, discussed by Henderson et al. (2000), Brown and Ibrahim (2003a, 2003b), Ibrahim et al. (2004), Chen et al. (2004b), Brown et al. (2005), and Chi and Ibrahim (2006, 2007), is the identification of longitudinal markers for survival. These authors focus on the use of longitudinal marker trajectories to investigate the association between a longitudinal marker and survival. Renard et al. (2002) used a joint model to explore the usefulness of prostate-specific antigen as a marker for prostate cancer.

Third, and most relevant for us here, such joint models can be used when incomplete longitudinal data are collected. Whenever data are incomplete, one should a priori consider the joint distribution of the responses and missing data process. In this sense, selection models and pattern-mixture models are merely convenient ways to decompose this joint distribution. In a number of applications, it may be attractive to write this joint distribution in terms of latent variables, latent classes, or random effects. This leads to so-called shared-parameter models. In principle, one can augment the full-data distribution with random effects

f(yi,ri,bi|Xi,Wi,Zi,θ,ψ,ξ)

(39)

and then still consider the selection-model factorization

f(yi,ri,bi|Xi,Wi,Zi,θ,ψ)=f(yi|Xi,bi,θ)f(ri|yi,bi,Wi,ψ)f(bi|Zi,ξ)

(40)

and the pattern-mixture model factorization

f(yi,ri,bi|Xi,Wi,Zi,θ,ψ,ξ)=f(yi|ri,bi,Xi,θ)f(ri|bi,Wi,ψ)f(bi|Zi,ξ).

(41)

Here, Zi and ξ are covariates and parameters, respectively, describing the random-effects distribution. Little (1995) refers to such decompositions as random-coefficient selection and pattern-mixture models, respectively.

Important early references to such models are Wu and Carroll (1988) and Wu and Bailey (1988, 1989). Wu and Carroll (1988) proposed this kind of model for what they termed informative right censoring. For a continuous response, Wu and Carroll suggested using a conventional Gaussian random-coefficient model combined with an appropriate model for time to dropout, such as a proportional hazards, logistic, or probit regression. The combination of probit and Gaussian response allows an explicit solution of the integral and was used in their application.

In a slightly different approach to modeling dropout time as a continuous variable in the latent variable setting, Schluchter (1992) and DeGruttola and Tu (1994) proposed joint multivariate Gaussian distributions for the latent variable(s) of the response process and a variable representing time to dropout. The correlation between these variables induces dependence between dropout and response. To permit more realistic distributions for dropout time, Schluchter (1992) proposed that dropout time itself should be some monotone transformation of the corresponding Gaussian variable. The use of a joint Gaussian representation does simplify computational problems associated with the likelihood. There are clear links here with the Tobit model, and this is made explicit by Cowles et al. (1996), who use a number of correlated latent variables to represent various aspects of an individual’s behavior, such as compliance and attendance at scheduled visits. Models of this type handle nonmonotone missingness quite conveniently. There are many ways in which such models can be extended and generalized.

An important simplification arises when Yi and Ri are assumed independent, given the random effects. We then obtain the shared-parameter decomposition

f(yi,ri,bi|Xi,Wi,Zi,θ,ψ,ξ)=f(yi|Xi,bi,θ)f(ri|Wi,bi,ψ)f(bi|Zi,ξ).

(42)

This route was followed by Follman and Wu (1995). Note that, when bi is assumed to be discrete, a latent-class or mixture model follows. Rizopoulos et al. (2008) study the impact of random-effects misspecification in a shared-parameter model. Beunckens et al. (2008) combine continuous random effects with latent classes, leading to the simultaneous use of mixture and mixed-effects models ideas. It is very natural to handle random-coefficient models and in particular shared-parameter models in a Bayesian framework. Examples in the missing data setting are provided by Best et al. (1996) and Carpenter et al. (2002).

7 Bayesian methods

Daniels and Hogan (2008) provide a comprehensive survey of Bayesian methods for longitudinal models with missing data. We refer the reader to their book and the many references therein. Here, we only provide a brief general discussion of implementational and methodologic issues for the Bayesian paradigm in the presence of missing data. Fully Bayesian methods require specifying priors for all the parameters and specifying distributions for the missing covariates and/or missing data mechanisms, along with the sampling distribution of the response variable. We note here that Bayesian methods for any missing data problem are, in principal, quite straightforward to implement compared to the no missing data situation. This is due to the fact that all one needs to do in the Bayesian paradigm is to add additional steps to the Gibbs sampler, for example, to sample from the complete conditional distributions of the missing data. Such steps can be easily incorporated into an existing Gibbs sampler for a no missing data problem, and will be generally easier to implement than the MCEM algorithm discussed earlier. These fully Bayesian procedures can be easily implemented in WinBUGS or PROC MCMC in SAS for all types of models, including models for longitudinal data.

However, new issues arise in fitting Bayesian models with missing data that do not arise in the frequentist development. First, one has to ensure that the posterior distribution is proper when using improper priors, as it is very easy for the posterior to be improper especially in nonignorable missing data settings. These issues and other modeling and elicitation issues are discussed in Ibrahim et al. (2001, Chap. 8), Ibrahim et al. (2002, 2008a), and Chen et al. (2002, 2004a, 2006). Second, even when using proper priors, if the model is weakly identifiable, which is often the case in many nonignorable missing data problems, the inferences may be quite sensitive to the choices of the hyperparameters, and one needs clever strategies for specifying informative priors that do not dominate the likelihood. Such strategies are outlined in Huang et al. (2005) for GLMs that can be easily extended to models for longitudinal data. Thirdly, it is conceivable that fully Bayesian methods may be more computationally intensive than their frequentist counterparts and Markov chain Monte Carlo convergence may not be easily achieved.

Thanks to advances in terms of both available methodology and efficient implementations thereof, not in the least in generally available statistical software tools, such as SAS, SPSS, SPlus, WinBUGS and R, quite advanced analyses are within reach, and there is no longer a need to focus on such simplistic methods as a complete-case analysis or last observation carried forward, to name but a few. At the same time, all methods, no matter how sophisticated, rest to some extent on unverifiable assumptions, owing to the simple fact that the missing data are unobserved. Therefore, rather than placing belief in a single such model, it should be supplemented with appropriate forms of sensitivity analysis.