Suppose that researchers have conducted a longitudinal study following two (or more) groups of subjects and measuring a continuous response (or dependent or outcome) variable over time. Their typical primary aims involve testing whether the mean of the response changes over time for each of the groups, as well as whether such change occurs differentially within groups. One common example of this occurs in the context of an intervention study, in which response variable trajectories over time for the intervention and control groups are compared.

The classical approach for addressing this issue is to conduct a repeated measures analysis of variance (ANOVA) fitting a full factorial fixed effects model containing terms for the main effects of group and time as well as their interaction. Testing for a differential change over time within groups can be addressed through the group-by-time interaction effect. This model may seem like a reasonable choice, but it is important for researchers to understand that this choice involves assumptions that may be inappropriate for the data. Specifically, it assumes that variances are constant (or homogeneous) over time, when they might change over time instead, and that correlations between two different response measurements are consistently the same, no matter how far apart they occur in time. This scenario is called compound symmetry (CS).

For longitudinal data, especially when collected over long time periods, it seems more natural for correlations to weaken as measurements occur further apart, which is reflected through autoregressive (AR) structures. Moreover, there are many other choices besides CS and AR for modeling correlations, together with either constant or heterogeneous variances. Furthermore, researchers should understand that the different choices for how the variances and correlations (or, equivalently, the covariance structure) are modeled can lead to critically different significance levels for hypothesis tests about means (e.g., Park, Park, & Davis, 2001 and the example given below) addressing their research aims. Linear mixed models (LMMs) can be used to generate these alternatives for the covariance structure, along with alternatives for the fixed effects (for modeling the mean of the response).

Choosing a standard covariance model like CS or AR subjectively can lead to erroneous conclusions. We provide below an example where that is precisely the case. We describe a more objective approach, in which model selection criteria are used to make such decisions. As an added benefit, this approach to model selection is separate from tests of the fixed effects, as used to address the specific aims of a study, and so we can avoid bias for those results by keeping them masked until the final choice of the covariance structure is made.

Use of model selection criteria to identify an appropriate covariance structure should be standard practice. In this article, we review alternative model selection procedures. We describe strategies for choosing the most appropriate model and then demonstrate their use with an example.

Linear Mixed Modeling

Continuous response variables measured longitudinally over time can be analyzed with LMMs (e.g., see Brown & Prescott, 2006; Fitzmaurice, Laird, & Ware, 2004; Verbeke & Molenberghs, 2000). These are supported in popular statistical software tools like SAS (current version 9.3; SAS Institute, Inc., Cary, NC) and SPSS (current version 19.0; SPSS, Chicago, IL). Results are reported in what follows for only these two tools, but similar results will hold for other tools like Stata (StataCorp LP, College Station, TX) and R (The R Foundation for Statistical Computing, Vienna, Austria). The fixed effects of these models describe the mean response as a function of available predictors using a general linear model formulation. This formulation encompasses ANOVA, regression, and analysis of covariance (ANCOVA) as special cases. The random errors are treated as multivariate normally distributed (i.e., the generalization of the normal distribution to the multivariate response context) with zero means and a specified covariance structure. This covariance structure determines within-subject correlations and variances. It can be directly specified (e.g., CS), indirectly specified through random effects, or based on both of these.

Standard ANOVA, regression, and ANCOVA models assume that errors for all measurements are mutually independent, which is usually inappropriate for longitudinal data. Standard repeated measures ANOVA, regression, and ANCOVA models assume a CS structure, the simplest dependent case. Alternately, standard multivariate ANOVA, regression, and ANCOVA models assume a completely general or unstructured (UN) covariance matrix, in which all variances and all distinct correlations are treated as separate parameters.

AR is an intermediate covariance alternative. Standard AR treats adjacent times as being equally spaced, even when they are not. Spatial AR, on the other hand, accounts for the actual timing of the responses and so adjusts for unequally spaced data. Spatial AR is currently supported in SAS but not in the current version of SPSS. CS, AR, and spatial AR each assume constant variances across time. Heterogeneous variances are currently supported for CS (CSH) and standard AR (ARH), but not for spatial AR.

The covariance structure may be specified indirectly through random coefficients (i.e., random intercepts and/or random slopes; Laird & Ware, 1982). The advantage of this approach is that these models account for actual time values and so can be effective for unequally spaced times (e.g., measurements at purposely unequally spaced times like 0, 1, 3, 8, 12 months, or measurements at times that vary with subject).

Maximum likelihood (ML) methods are typically used to estimate parameters corresponding to the fixed and random effects in LMMs. Standard ML generates biased estimates of covariance parameters (i.e., on average, the estimates do not equal the parameters they estimate), and so the default estimation approach in SAS and SPSS is restricted ML (REML), which generates unbiased estimates of covariance parameters. For large data sets, these two estimation alternatives will produce similar results.

Testing Fixed Effects in Linear Mixed Models

F-tests are used to test fixed effects of LMMs, but unlike for standard ANOVA, regression, and ANCOVA models, there is no commonly accepted way to do this. F-tests require two types of degrees of freedom (df), relating to the numerator (NDF) and the denominator (DDF) of the F statistic. There is agreement on how to compute the NDF parameter, but different approaches are available for determining the DDF parameter for these tests. Currently, SPSS uses a single non-documented approach (apparently the Satterthwaite approach). SAS, on the other hand, supports five different alternatives, including the between-within, containment, Kenward Roger, residual, and Satterthwaite approaches (for formulations, see SAS Institute, Inc., 2004). The default choice in SAS depends on the covariance structure: either between-within, if random effects are not specified, or containment, if random effects are specified. Although these different approaches are likely to generate the same conclusions in most cases, the conclusions could be affected by the choice for the DDF. An alternative is not to use any of these built-in F-tests and conduct instead likelihood ratio tests (LRTs) using the chi-square distribution to compare a model (e.g., a full factorial ANOVA model in group and time main effects plus a group-by-time interaction) to a nested model with one of its terms removed (e.g., the additive ANOVA model with the group-by-time interaction removed, leaving only main effects to group and time). A significant result indicates evidence in support of the full model, while a nonsignificant result is interpreted as preference for the nested model under the principle of parsimony.

Model Selection Criteria

Models can be compared, independent of tests for fixed effects, through penalized likelihood criteria (PLC) and cross-validation scores. Here, we give an overview of these model selection criteria, which have the added advantage of supporting comparisons of non-nested models.

Penalized Likelihood Criteria (PLCs)

PLCs are used commonly for model selection purposes (Koehler & Murphree, 1988; Sclove, 1987; Stone, 1979) and are generated in both SAS and SPSS for evaluating LMMs. Basically, likelihoods are combined with a penalty term, and scores are computed so that smaller values indicate better models. The best known PLC is the Akaike information criterion (AIC), with penalty based on the number of parameters. Schwarz's Bayesian information criterion (BIC; also called the Schwarz criterion) is another popular alternative PLC, with penalty based on both the number of parameters and the sample size. Currently, SAS uses the number of subjects as the sample size to compute BIC for longitudinal data; SPSS uses the number of measurements. (These are referred to as SAS BIC and SPSS BIC in the example analysis that follows.) These two alternatives have been generalized by Jones (2011) to an extended BIC score based on an effective sample size, with values ranging from the number of subjects to the number of measurements, but that approach is not considered here. Other PLCs are supported in SAS and SPSS, such as the corrected AIC (or AICC; Shi & Tsai, 1998), but are not considered in the example analysis.

When REML is used to generate parameter estimates, both SAS and SPSS compute AIC and BIC scores using only the number of covariance parameters, and do not include the number of fixed effect parameters. Consequently, AIC and BIC scores based on REML should be used only to compare models with different covariance structures and the same fixed effects model. On the other hand, with standard ML, both SAS and SPSS compute AIC and BIC scores using the combined number of covariance and fixed effects parameters, and so these scores can be used to compare different fixed effects models as well as different covariance structures.

Cross-Validation

In cross-validation, models are compared using scores computed separately for subsets of the available data values (e.g., pairs [x,y] in a bivariate regression) with parameter estimates based on the other data values. There are several alternative approaches for conducting cross-validation (Burman, 1989; Knafl, Fennie, Bova, Dieckhaus, & Williams, 2004). One approach is k-fold likelihood cross-validation (LCV; Knafl et al., 2004, 2010), where subjects are randomly partitioned into k distinct subsets, called “folds.” The likelihood for each fold is computed using parameters estimated from data for subjects in all other folds and combined into a LCV score, with larger values indicating better models. LCV for LMMs is not supported directly in SAS and SPSS. However, a SAS macro for computing k-fold LCV scores for LMMs is available on-line (http://www.unc.edu/∼gknafl/longtdnl.html, accessed 7/20/12).

Comparing LCV Scores

LCV scores can be compared using LCV ratio tests similar to LRTs (Knafl & Grey, 2007; Stone, 1977). To compare a model with a better (larger) LCV score to a simpler alternative model (e.g., with fewer fixed and/or random effects parameters) with a worse score, we used the percent decrease in the LCV score for the simpler model compared to the model with the better score. The threshold for a substantial (or distinct) percent decrease corresponds to the threshold for a significant chi-square test with the smallest possible change (df = 1) in the number of parameters (as reported by the SAS macro for computing LCV scores). (Formally, the percent decrease is a function of two times the log of the ratio of LCV scores, which is approximately chi-square distributed, and these can be related using Equation (6) of Knafl et al., 2010.) If the percent decrease exceeds its threshold, then the extra complexity of the model with the better score provides a distinct improvement. Alternatively, if the percent decrease does not exceed the threshold, the simpler model is a parsimonious, competitive alternative. In other words, its use imposes a penalty in reduced LCV score that is less than a significant reduction due to one less parameter.

The Smoking Data

We provide a step-by-step illustration of the concepts discussed above through analyses of a data set available on the Internet (http://www.biostat.harvard.edu/∼fitzmaur/ala/, accessed 7/20/12) consisting of 771 measurements of forced expiratory volume in 1 second (FEV1) for 133 adults in the Netherlands aged 36 or older over 7 times (0, 3, 6, 9, 12, 15, and 19 years, and so mildly unequally spaced) whose smoking status (current vs. former smoker) did not change during the study (a subset of the data from the study of van der Lende et al., 1981). Although not addressing all possible scenarios, this example guides the reader in how to approach possible steps of the decision-making process for selecting an appropriate LMM. Analyses were approved by the authors' Institutional Review Board as exempt.

The research question we addressed for these data is whether mean FEV1 changes differently over time for current compared to former smokers. Although the conventional approach for addressing this issue would be to fit a full factorial ANOVA fixed effects model (i.e., with factors for time, smoking status, and the smoking status by time interaction, yielding 14 means at 7 times for each of 2 groups), our example analysis considered this model as well as alternative choices and the consequential effects on conclusions.

Example Analysis

Choosing the Covariance Structure Subjectively

FEV1 measurements for the same subject can be as close as 3 years apart and as distant as 19 years apart, and so an AR structure (with correlations weakening as measurements become further apart) would seem appropriate for analyzing these data. Moreover, times were equally spaced 3 years apart, except for the last two time points which were 4 years apart, and so there was no compelling subjective reason to use spatial AR instead. The example analysis used standard ML so that aspects of both fixed and random effects in the LMMs could be assessed.

Using standard AR with constant variances together with the full factorial fixed effects ANOVA model, the interaction between smoking status and time was significant for all five DDF approaches (p = .019 for containment and residual; p = .020 for between-within and Satterthwaite, and p = .021 for Kenward Roger). This full factorial fixed effects model (including the interaction term) can be compared to the additive fixed effects model (without the interaction term) using the LRT testing for a significant difference in −2 times the log likelihood (−2 log(L)) for these two models, with the df set equal to the difference in the number of parameters for the two models. In this case, the difference in −2 log(L) was 15.05. The associated df = 14 − 8 = 6 because the full factorial model had different means for the 14 combinations of seven times and two smoking status types, and the additive model had different means for the seven times plus one more parameter for a constant shift in mean for current versus former smokers; this is equivalent to the df = 6 test of the smoking status-by-time interaction. The associated threshold for a significant chi-square test with df = 6 is 12.59 so the LRT indicated that inclusion of an interaction provided a significant benefit, in agreement with the AR-based F-tests.

Rather than using an AR structure, the researchers might have decided to use CS because that is the conventional approach for modeling repeated response measurements. Under this covariance structure, combined with the full factorial fixed effects ANOVA model, the smoking status-by-time interaction was nonsignificant for all five DDF approaches (p = .154 for all approaches). The LRT comparing the full factorial fixed effects model to the additive fixed effects model had a difference in −2 log(L) of 9.34, less than the cutoff of 12.59 for df = 6, which indicated that inclusion of an interaction did not provide a significant benefit, in agreement with the CS-based F-tests.

Importantly, these paradoxical results illustrated that the conclusion about the statistical significance of the interaction effect for the smoking data changed depending on the choice of covariance structure used to test that effect, no matter how the test was conducted. In terms of our research question, opposing conclusions would have been reached, depending on the covariance structure chosen for the analysis. Hence, it was crucial in this case to use objective model selection criteria to decide which covariance structure was appropriate and so determine which conclusion for the significance of the interaction effect was the proper one.

As described below, we used PLCs and LCV to compare alternative models. To compute LCV scores, a choice was needed for the number k of folds. Using the full factorial fixed effects ANOVA model with the AR covariance structure as a benchmark analysis, the first peak in LCV scores over multiples of fivefolds occurred for 15-folds, and so all reported LCV scores were computed with 15-folds (using the approach recommended in Knafl & Grey, 2007). The same result was achieved under the CS covariance structure. Subjects were randomly assigned to folds with all of their measurements allocated to the same fold. Sizes for the 15-folds ranged from 33 to 91 measurements, with average 51.4 out of the 771 total measurements. For this example, the threshold for a substantial percent decrease in LCV scores was 0.25%.

Choosing the Covariance Structure Objectively (Step 1)

Table 1 provides a comparison of a variety of basic structures for the within-subject covariances, using a full factorial fixed effects ANOVA model with factors of smoking status and time. For these data, AR and ARH both treat consecutive times as 1 unit apart, which may be reasonable for these data with times 3 years apart except at the end. However, because the last two times are 4 years apart, spatial AR, accounting for actual distances apart, may be better for these data. The three AR approaches generated a significant interaction effect, and CS, CSH, and UN identified it as nonsignificant (p = .020 for AR, p = .022 for ARH, p = .022 for spatial AR, p = .154 for CS, p = .146 for CSH, and p = .130 for UN, using the between-within DDF approach as also used in classical repeated measures analyses). Therefore, it was crucial to consider model selection criteria that do not consider such p-values to decide which of these models was most appropriate.

Table 1. Comparison of Alternative Basic Covariance Structures for Forced Expiratory Volume in 1 Second Over Time Using the Full Factorial Fixed Effects Analysis of Variance Model in Smoking Status and Timea

Table 1 contains AIC and BIC scores for the full factorial fixed effects ANOVA model with each of the above basic covariance structures. To see how these were computed under ML, note that the full factorial fixed effects model in group and time was based on 14 parameters. The CS covariance structure was based on two parameters, a constant variance and a constant correlation. Thus, the model had 16 parameters. AIC was computed as −2 log(L) + 2 × 16, that is, with penalty factor 2 × (number of model parameters), giving 264.00 + 32 = 296.00. SAS BIC was computed as −2 log(L) + log(133) × 16, that is, with the multiple 2 in the AIC penalty factor replaced by the natural log of the number 133 of subjects, giving 264.00 + 78.25 = 342.25 (differing from the value 342.24 reported in Table 1 due to rounding). SPSS BIC was computed as −2 log(L) + log(771) × 16, that is, with the multiple 2 in the AIC penalty factor replaced by the natural log of the number 771 of measurements, giving 264.00 + 106.36 = 370.36.

Table 1 also reports LCV scores for the same models as for AIC/BIC. All four approaches agreed that CS was the best-fitting choice for these data. Thus, the interaction effect was more appropriately considered to be statistically nonsignificant than as significant.

Although AIC and the two types of BIC scores agreed on the overall best model of Table 1, they did not always agree in general. For example, AIC considered UN the second best model of Table 1, and both BICs considered CSH second best. The two BICs disagreed on the comparison of the UN and AR models. Consequently, an objective approach is needed in general for deciding between different alternatives selected by different PLCs; a LRT or LCV can be used for these purposes. Our preference is to use LCV because that takes into account a model's performance on random subsets of the data, so that models capitalizing on chance variation in the data are likely to have low LCV scores. PLCs and LRTs, on the other hand, are based on estimates computed for the complete data and so are more likely to be influenced by chance variation effects when they occur.

Selection of a Fixed Effects Model (Step 2)

After applying model selection criteria to determine the structure of the covariances (i.e., CS for these data), the next step we considered was to determine the form of the fixed effects. Table 2 contains results for alternative fixed effects models under the CS covariance structure, including both ANOVA and regression models. Some researchers may prefer to limit this step to only ANOVA models or to only regression models, but both were considered here (and in Step 4) because that is more general.

Table 2. Comparison of Alternative Fixed Effects Models in Smoking Status and Time for Forced Expiratory Volume in 1 Second Over Time Using the Compound Symmetry Covariance Structurea

AIC and LCV selected the additive ANOVA model in time (as categorical with seven different values) and smoking status. The two BICs, on the other hand, selected the regression model in time (as continuous and linear with 1 slope) and smoking status. However, the percent decrease in LCV scores for the simpler regression model was only 0.13% (compared to the threshold of 0.25%), indicating that the regression model was a parsimonious, competitive alternative to the additive ANOVA model with more parameters. The LRT comparing these two models had a difference in −2 log(L) of 12.42. The associated df = 8 − 3 = 5, because the additive model had different means for the seven time points plus one more parameter for a constant shift in mean for current versus former smokers and the regression model had an intercept plus slopes for time and smoking status. The associated threshold for a significant chi-square test with df = 5 is 11.07, so the LRT indicated that the additional parameters were considered to provide a significant benefit.

Although we would hope that both approaches would lead to the same conclusion, in this case, the LRT chose a more complex model, and the LCV ratio test a more parsimonious one. If the model selection process stopped here, we would have used the more parsimonious regression model. However, because the more complex model had the better LCV score, we used this fixed effects model in the subsequent covariance structure selection analysis.

Consideration of Random Coefficients (Step 3)

The CS covariance structure selected in Table 1 is equivalent to a model based on a random intercept (one for each subject) combined with independent errors having constant variance. More complex random coefficient models are possible and might provide improvements. When the model has two or more random coefficients, they may be treated as either independent or dependent. To model such dependence, we used the UN covariance structure. The example analysis only considered random effects for time through constant, linear, or quadratic coefficients, either independent of each other or having UN covariance. Table 3 provides a comparison of these random coefficient models using the fixed effects model based on an additive ANOVA with the best LCV score in Table 2.

Table 3. Comparison of Alternative Random Coefficients Modelsa for Forced Expiratory Volume in 1 Second Over Time Using the Additive Fixed Effects Analysis of Variance Model in Smoking Status and Timeb

AIC and LCV both identified the model with dependent random intercept and slope for time, but the two BICs selected the model with independent random intercept and slope for time. The percent decrease in LCV scores for the simpler independent random coefficients model was 0.03%, indicating that the independent random coefficients model was a parsimonious, competitive alternative to the more complex dependent random coefficients model. The LRT comparing these two models had a difference in −2 log(L) of 3.98. The associated df = 1 because the dependent random coefficients model had one extra covariance parameter. The associated threshold for a significant chi-square test with df = 1 is 3.84, indicating that the more complex model was considered to provide a significant benefit. Once again, we experienced conflicting results, as LRT chose a more complex model, and the LCV ratio test chose a more parsimonious model.

The models of Table 3 were all based on independent errors having constant variance, but it is also possible to combine random coefficients with dependent errors. Table 4 contains results for a selection of such models. With this adjustment, AIC selected the model based on a random intercept combined with ARH errors, and the two BICs and LCV selected the simpler model based on a random intercept combined with AR errors. In any case, consideration of dependent errors allowed the random coefficient model to be simplified. The LRT comparing these two models had a difference in −2 log(L) of 16.22. The associated df = 6, because ARH had seven variance parameters compared to one for AR. The associated threshold for a significant chi-square test with df = 6 is 12.59, and so the LRT indicated that the more complex model provided significant benefits. Once again, the LRT chose a more complex model, and LCV a more parsimonious model. (A LCV ratio test was not necessary for this conclusion because the more parsimonious model had the larger LCV score.) When there are such disagreements, we prefer to use the model selected by LCV, and that model was used in subsequent analyses.

Table 4. Comparison of Alternative Combined Random Coefficients and Error Covariance Models for Forced Expiratory Volume in 1 Second Over Time Using the Additive Fixed Effects Analysis of Variance Model in Smoking Status and Timea

As LMMs grow in complexity, the chance for non-convergence in the estimation process increases. For example, convergence problems occurred for four of the cases considered in Table 4. When convergence occurs for the full data, that solution can be used as a starting point for computing estimates for subsets of the data as part of computing the LCV score, reducing the chance of non-convergence for those subsets. Non-convergence either for the full data or for some subset is an indication that the model is inappropriate to consider.

Reassessing the Fixed Effects Model (Step 4)

The covariance model was refined in Tables 3 and 4 under the additive fixed effects ANOVA model (as selected in Table 2 using the CS covariance structure as selected in Table 1) to one based on a random intercept combined with AR errors. It is possible that an alternative fixed effects model may be more appropriate under the revised covariance structure selected in Table 4, and so the fixed effects model should be re-evaluated. Table 5 contains results addressing this issue. As in Table 2, AIC and LCV selected the additive ANOVA model, and both BICs selected the simpler regression model in smoking status and time. However, the simpler model now generated a substantial percent decrease of 0.38% in LCV. After adjusting the covariance structure, the regression model, although parsimonious, was no longer competitive. The LRT comparing these two models had a difference in −2 log(L) of 15.83. As before, df = 8 − 3 = 5 with threshold 11.07, and the more complex model provided a significant benefit. In this case, both the LRT and the LCV ratio tests identified the more complex model as appropriate. This was the final selected model.

Table 5. Comparison of Alternative Fixed Effects Models in Smoking Status and Time for Forced Expiratory Volume in 1 Second Over Time Using Covariance Structure Based on a Random Intercept Plus Autoregressive Errorsa

The Final Model (Selected in Steps 1–4)

We considered a variety of covariance structures in this example analysis, including directly specified CS, CSH, AR, ARH, spatial AR, and UN structures, as well as indirectly specified structures based on constant, linear, or quadratic random coefficients in time, possibly independent or UN dependent, and with errors either independent having constant variance or dependent under the directly specified structures. We also considered a variety of fixed effects models of ANOVA and regression types. The final selected LMM had additive ANOVA fixed effects in smoking status and time, together with a random intercept and AR errors. This covariance structure yielded constant variances with estimated value 0.32—and so estimated constant standard deviation of 0.57—and estimated correlations decreasing as response measurements became farther apart from .88 at 3 years apart to .85 at 19 years apart. Mean FEV1 decreased for former smokers over time, from an estimated value of 3.52 (units not reported with the data) at baseline to 2.84 at 19 years later, and for current smokers at an estimated constant mean amount lower by 0.32 at each time.

Discussion

The example analysis demonstrated that subjectively choosing the covariance structure for a LMM can result in mistaken conclusions about the significance of fixed effects for that model, leading to the incorrect conclusion for a research question. The interaction effect in a full factorial ANOVA fixed effects model was either significant at p = .02 or nonsignificant at p = .15, depending on the choice of the covariance structure. Such distinct results would have serious implications for randomized clinical trials, for which significant interactions may be needed to demonstrate the effectiveness of their interventions. Objective model selection based on criteria like AIC, BIC, and LCV is needed to avoid mistaken conclusions, and we advocate that it should be accepted practice. Also, once the primary hypotheses of a longitudinal study are tested using an objectively chosen covariance structure, model selection of the fixed effects can be conducted to obtain a parsimonious depiction of how mean responses change with available predictors like time and treatment group.

The example analysis considered a wide variety of fixed effects models and covariance types, but other issues could be considered in a more complete analysis. The need to include subject characteristics (like age and gender) also can be assessed with model selection criteria. The effect of missing responses can be addressed using an indicator for whether or not a subject had some missing responses or the number of missing responses (although this can be biased in comparison to multiple imputation; Knol et al., 2010). These can be included as covariates in the fixed effects model. Covariance parameters could be estimated separately for subjects with varying amounts of missing responses (this currently is supported only in SAS, not in SPSS). An assessment of skewness is also needed with appropriate adjustments (Azuero et al., 2010).

The example analysis only addressed continuous responses, but a similar approach could be conducted for categorical responses. It only addressed correlation over time, but correlation alternatively can occur within clusters (e.g., members of the same family; patients of the same provider) or within clusters as well as over time (e.g., longitudinal family data as addressed by Chang & Kelly, 2011 and Knafl, Knafl, & McCorkle, 2005). Further work is needed in these areas.

For the example analysis, when there was disagreement among model selection procedures, AIC and LRTs tended to identify more complex models, and both types of BIC tended to identify simpler models. Although this is unlikely to hold in all cases, it suggests a bias for AIC and LRTs in favor of more complex models and for BIC in favor of simpler models. LCV as adjusted with LCV ratio tests, on the other hand, sometimes favors simpler models and sometimes more complex models. It is important, though, not always to choose the model with the best LCV score, but to use LCV ratio tests to identify parsimonious, competitive alternatives (and so “good enough” models; Cheng, Edwards, Maldonado-Molina, Komro, & Muller, 2010). Moreover, because LCV scores are computed using estimates for subsets of the complete data, models highly influenced by chance variation in the data (and so affecting only a small number of those subsets) are likely to have lower LCV scores than models that discount that chance variation. AIC and BIC scores and LRTs, on the other hand, are likely to be influenced by such chance variation because they are based on estimates for the complete data. Consequently, we prefer to base model selection decisions on LCV. However, LCV currently is not supported directly in statistical software like SAS and SPSS, although AIC and BIC are. LCV can be conducted for LMMs using the SAS macro available from the corresponding author, as was used in the example analysis.

Summary

Beginning with a discussion of alternatives for the analysis of longitudinal data, we demonstrated that fitting models to these types of data are substantially more sophisticated than for their cross-sectional counterparts that contain mutually independent observations. We also demonstrated a specific step-by-step modeling strategy to address these issues that starts with the full factorial ANOVA fixed effects model, finds the best directly specified covariance structure for this fixed effects model, and then adjusts the fixed effects model under the best directly specified covariance structure. Next, the adjusted fixed effects model is used to identify the best random effects covariance structure, either independent or dependent, along with independent or dependent errors. Finally, the fixed effects model under the best random-effects-adjusted covariance structure is adjusted again (so model selection can alternate between adjusting fixed effects and covariance structures of LMMs). Other modeling selection strategies are possible (e.g., Cheng et al., 2010), but as long as sufficiently complete sets of alternatives are considered for fixed effects and covariance structures, the results are likely to generate similar conclusions if not exactly the same models. Whatever the model selection strategy is, it needs to be reasonably complete, addressing the various alternative models for means, variances, and correlations of longitudinal outcomes over time and other available predictors such as treatment group, and needs to be based on model selection criteria that are independent of tests for zero model coefficients.