No, Mplus does not impute values for those that are missing. It uses all data that is available to estimate the model using full information maximum likelihood. Each parameter is estimated directly without first filling in missing data values for each individual.

I am having difficulty getting Mplus to converge on H1 (and thus to get a chi-sq test) for a missing-data-latent-growth-curve model, even when I fiddle with the starting values and convergence criteria. It runs fine when I do not ask for "type= missing h1;" but then I can't get the chi-sq. Am I missing some fundamental piece of the puzzle?

If you give the Mplus statement type=missing h1, the program first does H1 and then H0. You may want to first to a type=basic missing. The H1 estimation that this leads to can be difficult if there are large percentages of missing data - see the Covariance Coverage output. Starting values are not needed for H1. You can try to sharpen the convergence criterion as described in the User´s Guide.

I have data that do not fit the assumptions Mplus imposes for SEM with missing data so I am using a multivariate, multiple imputation approach such as that advocated by Little and Rubin.

My question is whether the coefficients and standard errors generated by the Mplus WLSMV estimator present particular problems for those planning on combining results from several separate imputed data sets.

In the manual, you point out that "Mplus has two special data handling features when data are missing because of the design of the study."

I understand that using the "not by design" missing data features, models assume that data are missing at random or missing completely at random. I have a data problem that doesn't technically seem to fit either scenario.

The study is looking at drug/alcohol treatment over time. It follows 2 cohorts of over 1,300 adults at baseline, 6 months, 18 months, 24 months, and 36 months. Because of funding constraints only 1 cohort (n about 700) was interviewed at 18 months. Both cohorts were interviewed at each of the other waves. I am wondering whether or not we should simply drop the entire 18-month wave data in a growth curve model or if we can somehow include the existing data from the 1 cohort who was interviewed. Technically, the missing cohort at 18 months was not missing at random and it did not seem to be similar to the missing patterns by design examples either. In addition, because this is a longitudinal study, there are data at waves other than 18 months that are missing, but these are more defensively considered “missing at random.”

I would not get rid of the data for 18 months. Measuring one cohort only at 18 months constitutes missing by design and is MCAR. You also have attrition which may be MAR but I couldn't comment on that. I would analyze all of the data using TYPE=MISSING. This is if your outcomes are continuous. TYPE=MISSING is not available for categorical outcomes.

I guess it is MCAR (i.e., a random event caused it). I wasn't thinking of it as such since I was wondering if any existing differences between the 2 cohorts would pose a problem in the missing data estimation. But it was not cohort differences that "caused" the missing data, just the flip of the cohort coin. Your advice is very helpful. Thanks!

My question is whether these 3 features can be used in conjunction. If not, I'm wondering if it would make sense to do multiple imputation on the missing data, and then use the complex sampling & robust estimators in conjunction?

We are having an update in about two weeks that will include crossing TYPE=COMPLEX with MISSING AND MIXTURE, but weights will not be allowed at this time. You can do a one class mixture and thereby cross complex without weights and missing. Perhaps that might help. The estimator is MLR. MLR has maximum likelihood estimates and robust standard errors. This may help. Otherwise, multiple imputation would be the way to go.

Work is being done in this area but nothing definitive is available at this time. A possibility is to use the pattern mixture approach (see Little's 1995 JASA paper), using covariates, a multiple group approach, or a mixture approach.

I am trying to estimate the Diggle and Kenward (1994) model in order to account for non-ignorable missing data. This would, however, require to estimate a multiple-groups model with heterogeneous structures (as it is called in AMOS) using MPLUS. Thus, different groups should be allowed to have different variables included in the analyses (e.g., in a longitudinal setting with one outcome variable measured up to six times, for group1 outcome1 would be modeled, for group2 outcome1 and outcome2 would be modeled, etc.). I did not find such an option in MPLUS, so my question is: Is it possible to estimate a multiple groups model with heterogenous structures in MPLUS (which would also be helpful for "pattern mixture models")?

In the Diggle and Kenward approach, Mplus would need to model the growth part among the "y's" and the missingness as a function of previous observed y's among the "u's", where the quotation marks are used to refer to the general model parts of the Mplus framework. The D & G approach therefore needs to be able to allow missingness on y's as well as "u ON y" (logit) regressions. This combination cannot be done in the current Mplus, but is planned for version 3 due out early 2003.

The pattern mixture approach, however, would seem to be possible to carry out in the current Mplus. This would not use the regular multiple-group track because that requires non-zero variance for equal numbers of observed variables in the groups, which is not present here due to missingness. Instead the mixture track (type=mixture) would be used. In the mixture track, there is no problem due to missing data and zero variance for y's. The groups corresponding to the dropout patterns can be represented by latent classes ("c"), where the known class membership is handled using the "training data" feature. The growth model parameters for the y's can then be allowed to vary across classes (groups, patterns) to the extent desired. We can help with trying this approach.

Theory supports the fact that the corrected standard errors (sandwich or White) for missing data are correct under MCAR with non-normal data. For normal data, they are correct under MAR. We have found that these corrected standard errors also work better than regular standard errors under MAR and non-normality. However, there appears to be no theory to support this (see Yuan and Bentler in Soc Methods, 2000).

Thank you for sending the outputs. The correct model in Mplus is the one using the WITH statements. The reason the answers did not agree is that this model did not converge. I added two starting values for the variables GRADRAT and CSAT which have large variances and the model convergeds to the same solution as AMOS.

What changes do I need to make in this input file when y10 is missing by design in groupB (e.g., Type = mixture missing)? Also, are there any fit indices unavailable after respecifying this missing data problem as a mixture analysis?

You can split groupB into two groups, one group for the observations with y10 present and the other with y10 missing. Add f1 by y10@0 for the last group and use type = mgroup meanstructure. Type = mixture missing is not going to give you what you want.

The idea of the solution proposed above is good - that y10 should not influence the fitting function in the last group where it is missing - but there seems to be two complications. Mplus will complain that y10 has zero variance in the last group. This can be circumvented by letting one person have a different value for y10 in the missing data group to give a quasi-nonzero y10 variance. Also, I think the weight matrix will be singular with zero variance and I don't know its quality if a quasi-nonzero variance is introduced for y10. I don't know if some other trick can be used. Categorical missing facilities are forthcoming in future Mplus versions.

Hello. I am working on an LCA model with some missing data, and I would appreciate some advice on its behavior. The (binary) latent class indicators include 5 behaviors measured at each of 2 posttreatment follow-ups (for a total of 10 indicators). About 40% of the sample were interviewed at the short-term follow-up, but not the long-term assessment. Some noninterviews were by design, and some represent attrition. I also have several pretreatment covariates I am using to predict the classes. There are several points I am wrestling with:

1. The "Test for MCAR..." is clearly nonsignificant. What practical effect should this have on modeling strategy?

2. Group membership changes noticably when I add predictors. I suspect many "changers" are individuals with only 1 interview, because they simply have less "u" information available for classification, increasing the importance of their "x" information. If this is true, is it reasonable to believe that the LCA with covariates is more likely to be the "correct" model? Should decisions about the number of classes be made in the presence of the covariates?

3. To test for possible nonignorable missingness, is it appropriate in the context of LCA to try a "pattern mixture" approach (in the spirit of Little or Hedeker & Gibbons)? That is, adding a "missing interview" indicator and interaction terms to the set of covariates.

Re 1, your MCAR test would seem to suggest that you can feel more comfortable using the ML approach that you are using. The ML approach is correct under the less strict MAR assumption. So, having support for MCAR is comforting, but I don't see that it changes your modeling strategy.

Re 2, changing group membership may point to a misspecification. If in the true model the predictors influence only class membership and not the latent class indicators directly, then you should get statistically the same membership with and without predictors in the model. But if the true model has some direct effects of predictors on latent class indicators, then class membership will change when including predictors but not allowing for the direct effects. The solution is to examine the need for direct effects by including one at a time and looking at chi-square differences (2*logL differences). It is also correct that predictors help to better determine class membership when the latent class indicator information is not strong, but with a correctly specified model this additional information should not cause essential changes in membership.Re 3, yes, a pattern mixture approach could be useful here.

If I am trying to run a Discrete-Time Survival Analysis, but I have missing data in my X values, is the only way for me to estimate a model with missing data is to use a program such as NORM and impute the missingness?

Yes, unless the x variables are such that they do not influence class membership, in which case they can be turned into "y variables" (for which missingness is handled) by referring to a parameter for x (e.g. its mean).

I need help running an EFA with missing data. Missingness is due to use of a 3-form design for 180 participants; data from 49 additional participants who completed either the first or second half of the 64-item set is also included. Covariance coverages range from .262 to.633. I used the following code--my first MPLUS experience.

I have tried lowering the coverage crtiterion to .08, running the model with up to 16 of the 64 variables of interest deleted, eliminating the H1ITERATIONS & H1CONVERGENCE statements, and using analysis type missing basic. The messages I get go something like this...

THE MISSING DATA EM ALGORITHM FOR THE H1 MODEL HAS NOT CONVERGED WITH RESPECT TO THE LOGLIKELIHOOD FUNCTION. THIS COULD BE DUE TO LOW COVARIANCE COVERAGE OR A NOT SUFFICIENTLY STRICT EM PARAMETER CONVERGENCE CRITERION. CHECK THE COVARIANCE COVERAGE, OR SHARPEN THE EM PARAMETER CONVERGENCE CRITERION, OR RERUN WITHOUT H1 TO OBTAIN H0 PARAMETER ESTIMATES AND STANDARD ERRORS. NOTE THAT THE NUMBER OF H1 PARAMETERS (MEANS, VARIANCES, AND COVARIANCES) IS GREATER THAN THE NUMBER OF OBSERVATIONS. NUMBER OF H1 PARAMETERS : 2144 NUMBER OF OBSERVATIONS : 229

I think that the covariance coverage is adequate--how do I go about changing the convergence criterion or running the model without H1?

In response to your question, yes the missingness is by design; 180 participants completed 3-form design questionnaires containing 2 of the three subsets of items. I have additional data from 49 participants, each of whom completed half of the 64 items of interest. What options does this give me?

Here is an answer about what one can do in principle with missing by design - without claiming that this is how you should try to do your analysis. If I understand your design correctly, apart from the 49 subjects, there are 3 groups of subjects, each of which has missingness on parts of the variables. In a CFA, these 3 groups could be handled via multiple-group modeling where in each group only the reduced set of variables actually observed in the group would be considered, so that each group would only have missingness that is not by design. This would be an analysis with only about 2/3 of the variables and therefore perhaps less heavy. This approach has 2 complications for you. One, it is not clear how to handle the 49 participants since each group needs to have the same number of variables. Two, you want to do an EFA.

Regarding your analysis, what is the lowest coverage value that gets printed?

I am thinking about using multiple imputation with data on which i am doing a structural equation model. the outcome variable in this model is dichotomous, which limits my options for handling missing data. i am considering using multiple imputation, and am wondering how to approach doing this in mplus. i can create my multiple data sets in other software packages. i have read on your website about the RUNALL facility. would it make sense to run the analyses with that? also, are there any other features in mplus that might be useful in this, including anything that would combine the results from the multiple runs to give final estimates? (or is that step something i need to do by hand?). thanks.

I would like to use the MLR option across multiple imputations. Because I have no missing data I am not specifiying type=missing. When I try to run the model I get a message telling me that the MLR estimator is not available with type=general. Is MLR only available if you have missing data?

My sense is that Mplus can only account for data missing on Y variables.

Is this because the computation is too intensive to include imputation on X's, or because its empirically incorrect to impute on Y's and X's at the same time ?

I ask because I've noticed that one of the HLM packages allows multiple imputations on X and Y in the same model run. This would appear to imply that such models borrow information from the X's to impute Y's (and vice-versa).

Will Mplus allow for imputation on X's and Y's in the near future (version 3.0) ?

Modeling typically concerns a specification of the distribution of y | x (y conditional on x), whereas the marginal distribution of x is not involved in the model. When there is missing on x, a model for the marginal x part needs to be added. This is true for imputations as well as other modeling. That's why missingness on x's changes the picture and is not trivial - it calls for an extended model that may be hard to specify realistically.

I am not clear on what type of x modeling HLM does for imputations in the x part - I am not sure that this is stated; please let me know if I am wrong. Mplus does not do imputations, but handles missing data in a general way using ML under MAR. Mplus can handle missing on x's if they are brought into the model as "y's". This is done automatically in some tracks of the program (such as non-mixture, non-categorical). In other tracks, x's can be moved into the y set by mentioning parameters related to them in the model. Missing on x is then handled by a normality model for the x's. Normality may not be suitable if x's are say binary and skewed. In Schaefer's imputation programs, missingness categorical x's is handled by loglinear modeling. Mplus Version 3 will have more facilities related to missingness on categorical variables and missingness for variables that have random slopes.

Both 1 and 2 can be estimated in the current version of Mplus. These techniques have been available since Version 2.1 which came out in May 2002. The use of these techniques is described in the Addendum to the Mplus User's Guide which can be found at www.statmodel.com under Product Support. More features are coming in Version 3 the Fall.

The Mplus techniques for multilevel SEM with missing data are described in a paper that we will be happy to make available at the end of the Summer. We are not aware of any other references on this topic.

To calculate the significance of the R sq change (0.062 - 0.011 = 0.051) can I simply calculate the change in Chi sq (17.652 - 3.080 = 14.772), the change in DF (5 - 1 = 4) and conclude that the R sq change in significant @ p<.01? (The crititcal value of Chi sq for 4 df and p < .01 is 13.28). Or am I on the wrong track completely?? For your advice please,

A chi-square difference test can not be used to determine whether a r-square difference is significant. It can be used to see if parameters in nested models are significant. For example, you could compare

2. Can I compare nested models (e.g., resticting covariances to be equal across groups)using a chi-square difference test when using the missing command?

3. When I run a mgroup analysis (not specifing missing) leaving the 'estimator=' blank, Mplus uses the ML estimator. Can I always trust what Mplus picks? For example, I have some categorical ivs and and some categorical indicators of a latent variable.

*** WARNING Data set contains cases with missing on all variables. These cases were not included in the analysis. Number of cases with missing on all variables: 115 *** WARNING Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 318 *** WARNING Data set contains cases with missing on all variables except x-variables. These cases were not included in the analysis. Number of cases with missing on all variables except x-variables: 209

Regression on x's does not include a model for the x distribution, but the model concerns the y outcome conditional on the x's. To handle missing data on x's, you need to expand the model to include a model for the x's, e.g. assuming normality. This can be done in several ways. One way is to first do a multiple imputation step outside Mplus. Note that Mplus can take multiply imputed data as input. Another way is to include the x's in the model in Mplus - this is done by mentioning say their variances:

x2-x4;

You can use ML estimation by the Analysis option

estimator = ml;

in which case the missing data on your 3 x's results in 3 dimensions of numerical integration.

I am running a parallel process latent growth curve model in version 3.11(3 equally spaced time points)involving two outcomes measured continuously (depression and smoking). The latent intercepts and slopes are regressed on two x's: gender and number of siblings. The data are nested (individuals nested within schools). There are missing values on both the Y's and on the sibling "X" variable.

The model appears to run fine (no warnings or error messages) and generates results that make sense. However, when I attempt to evaluate the plausibility of the model for girls and boys separately, I get a warning message that states "data set contains cases with missing on x-variables. These cases were not included in the analysis". Below is the sytnax used for latent growth model using the multiple group procedure:

Am I making an error somewhere in the syntax? Does M-plus offer FIML for LGC models where there is missing data on both the Y's and the X's in the context of running complex models (nested data) and multiple group comparisons?

I don't see how you would not have missing on x's when you read the full data set and have missing on x's when you look at part of the data set. If you send the two outputs, the one that worked and the one that didn't, and the data, to support@statmodel.com, I can figure this out.

Regarding missing on x's, the following is from Chapter 1 in the Mplus User's Guide:

"In all models, missingness is not allowed for the observed covariates because they are not part of the model. The outcomes are modeled conditional on the covariates and the covariates have no distributional assumption. Covariate missingness can be modeled if the covariates are explicitly brought into the model and given a distributional assumption."

I have a question concerning missing data. I am constructing a twolevel model. Some of my within-level variables have missing values. Should I specify that type is twolevel missing or is it not necessary? I am using mlr as an estimator.

The unrestrictied model is the model of all means, variances, and covariances of the observed variables being free. There are no restrictions on any of the parameters. It is the H1 model. The reason that it is not automatically estimated with TYPE=MISSING in all cases is that it can be slow and is needed only to compute chi-square. So Mplus has it as an option. Without it, you will get parameter estimates and standard errors but not chi-square.

I have a model where I am testing for invariance of structural paths across gender in the multiple-group context (all observed, continuous variables) but I am concerned that I have data that are NMAR. One of my endogenous variables is frequency/quantity of alcohol use and I have strong reason to believe that missingness on alcohol use is related to true levels of alcohol use. Consistent with suggestions made in earlier postings, I have constructed training data to represent 4 gender X missing data groups (i.e., males w/complete data, males w/incomplete data, etc. - missing data patterns are too sparse for additional patterns). In order to get weighted averages for structural coefficients (and intercepts) across missing data patterns (as you would via Hedeker/Gibbons 97), I have constrained all parameters to equality within gender for all models (i.e., male incompletes and male incompletes equated, female incompletes and female incompletes equated).

My base model would be the fully constrained model (all parameters equated across all 4 groups) - In order to test for invariance across gender, I allowed males to differ from females (but maintained equality constraints between the within-gender missing data groupings). I then used 2(deviance1-deviance2) for single df X^2 difference tests for invariance across gender. I wanted to know if a) this approach to pattern mixture modeling is generally defensible, b) could I compare the deviance from a fully saturated model against my base model so I can give an indication of "model fit" (and hand calculate RMSEA and the like) and c) if this is defensible, is there a citation to justify this approach specifically in SEM other than Hedeker/Gibbons 97 or Little 93 (i.e., Muthen/Brown 01 - is this manuscript available)?

A pattern-mixture approach of this kind seems generally ok, but I need some clarification. You seem to have a typo in the parenthesis at the end of the last paragraph and the second parenthesis of the second paragraph sounds strange to me. Seems like you want to test gender invariance while allowing for missing data differences within each gender. Also, does Hedeker's work give formulas for weighting regression coefficients across the missing data groups? I have not seen a reference to pattern-mixture for SEM. Muthen-Brown is still not available and is focused on actually letting latent variables predict missingness.

Yes, I am looking to test gender invariance and adjust for differences across the missing data groups.........Hedeker does give formulas for weighting regression coefficients across missing data groups. He does this by weighting the estimates by the observed proportions among the missing data groups in his 97 Psych Methods paper (an illustration of a conditional LGM with NMAR dropout in Proc Mixed). Here is the link to the .pdf http://tigger.uic.edu/%7Ehedeker/RRMPAT.pdf. Formulas 12 and 14 are the formulas for the weighted estimates and standard errors respectively. The corresponding dataset and SAS IML program that performs the matrix operation described therein is at http://tigger.uic.edu/~hedeker/ml.html. I simulated longitudinal data that were NMAR for 2 missing data patterns and analyzed the simulated data both in Proc Mixed/IML with his approach and in GGMM in Mplus with the two missing data groups identified w/training data and got nearly identical estimates. So the approach seemed viable but I did not want to move forward without consultation. I have had great difficulty finding a published analog to Hedeker's approach in SEM and had wondered whether Muthen-Brown was the SEM analog but also wanted your take on the approach before going forward........

I'm running a simple linear regression analysis in Mplus3 where I want to correct the standard errors for the design effect (two-level structure) as well as estimate this model with missing data.

The "problem" is that I have got missing data on both the dependent variable and several of the independent variables. In another posting on this page you wrote

"Regression on x's does not include a model for the x distribution, but the model concerns the y outcome conditional on the x's. To handle missing data on x's, you need to expand the model to include a model for the x's, e.g. assuming normality. This can be done in several ways. One way is to first do a multiple imputation step outside Mplus. Note that Mplus can take multiply imputed data as input. Another way is to include the x's in the model in Mplus - this is done by mentioning say their variances:

x2-x4;

This leaves me a bit confused on exactly how to write it in syntax. Could you please take a look at my syntax and see if this is the correct way to write a model where we have got missing data in a complex design (where there is also missing on the independent variables) as well as correcting the standard errors.

****** In this model we wish to see how six different family structures expressed as five dummy variables (singlem1 singlem2 stepf JPC singlef) predicts antisos after we have controlled for age gender education. I have got no missing data on the dummy variables-only on the control variables and the dependent variable.

Is this the correct way to do it? Do you think it's best to impute the missing data with multiple imputation (NORM) before you use mplus or to to include the x's in the model in Mplus - by mentioning their variances (as I think have done here)? And related: Is it ok to impute missing data with Norm and use the imputed datasets in mplus3 even when you have got a nested data set?

I think this is the correct approach. I don't think NORM handles clustered data. You may have to correlate the observed variables using the WITH option. I'm not sure of the default. You could, however, remove the dummy variables from the variance list given that they have no missing data.

I have longitudinal survey data at 5 time points. I am interested in using multiple imputation to handle missing data. I plan on using available data from time 1 for the imputation model used to impute values for the missing values at time 1. I would like to use the imputed (i.e., complete) data from time 1 to help impute the missing data in time 2, and so on.

I have two main questions about this:

Would you recommend doing a single imputation for each wave of data. Otherwise, I would have, say, m=5 imputed data sets for time 1, and then it is not clear how I would go about using time 1 to help impute the time 2 data.

Also, do you have recommendations about whether to use individual items vs. scale scores in the imputation model? I would like to have complete item-level data (for subsequent factor analyses), not just complete scale scores (for path analyses, for example). I have seen examples of multiple imputation and they all use scale scores in the model. Is it ever appropriate to use individual items in the imputation model? I can't seem to find anything about this in the literature.

In principle, a good approach would be to use item-level data for all 5 time points jointly, perhaps adding covariates, analyzing these variables by ML under the usual MAR assumption. This approach is certainly doable on the scale score (or IRT theta) level. But perhaps the approaches you discuss are motivated by this ML-MAR approach involving too many variables when working on the item level. Perhaps that is why you suggest a time-by-time approach. However, the use of complete (partly imputed) data from time 1 for imputing values for time 2 does not seem like a good approach to me since it is acting as if the imputed time 1 data are real. And a single imputation would not give the desired result of multiple imputation - showing the true variability. Staying with the idea of imputing item-level data for each time point separately, it seems feasible to do this using observed data (not imputed) data on items (and covariates) at all other time points. I am not familiar with literature on these matters.

Thank you for your quick reply. When you say "use the item-level data for all 5 time points jointly, perhaps adding covariates," isn't it the case that the covariates would already be factored in due to having all the survey items in the imputation model already? So I'm no sure what you mean here. Are you saying that if I want to use depression info as part of the model to impute values for missing anxiety scale items, I should use the depression scale score instead of the depression scale items?

Also, just to clarify, you think it is appropriate to use data collected at subsequent time points to impute values from previous time points? (I'm not arguing against the view, just wnat to clarify). Would this still be the case if there is a reason to expect measurements to change over time (e.g., some of the participants belong to an anxiety treatment group)?

When I mentioned covariates I was making a distinction between background variables (e.g. demographics) and the (test?) items - it sounds like you are calling all of these variables "survey items" so we were probably just using different vocabulary. So my answer is no to your question at the end of your first paragraph. Regarding your second paragraph, yes my inclination would be to use any variable that might be correlated with the items with missing data.

I am using GGMM to analyse a longitudinal dataset with missing values. It seems that if "missing" is specified in the variable and analysis commands, FIML method will be utilized and the default algorithm is EM.i.e. the observed log likelihood will be maximized. am I right about this? In the output I got the warning says the fisher's information matrix and standard error matrix related to some parameters cannot be inverted. what does it imply generally?

one more question,is MCMC ever be considered in Mplus when dealing with latent variables and missing values?

I am using type = missing h1 (with the ML estimator) for a structured equation model using all continuous variables (latent and manifest). I am trying to provide a brief description of what missing h1 does for a manuscript. I read the manual but was confused. Could you provide a brief description for inclusion in the manuscript?

"Missing H1" says that we want to do ML estimation of an unconstrained (saturated) covariance matrix for the observed variables taking missingness into account under the MAR assumption (see the Little & Rubin book). This ML-MAR estimation is carried out using the EM algorithm in line with the L-B book. The estimated covariance matrix is used to compute a chi-square test of model fit, comparing H0 to H1.

Yes, it is. There is a table in Chapter 15 of the Mplus User's Guide that shows which TYPE options are avaiable for variaous estimators and outcome scales. See ESTIMATOR in the index of the user's guide to find this table.

Dear Dr. Muthen, I am running a path analysis with 7 IVs at Time1 predicting 2 DVs at Time2. I have some missing data (not a huge amount, the coverage is around .9 for all variables). I have specified the Type = missing h1 under the analysis command. I have the following questions: 1. Does this missing data command take into account ALL variables that are listed under NAMES ARE, or does it use the variables that are listed in the USE VARIABLES ARE only? 2.If the latter is true, how do I go about letting other variables into the missing value analysis?(for instance relevant covariates listed in the NAMES ARE list)? 3. One of the Time one variables is gender. The way I have written the syntax now is just listed gender after the ON statement. Should I specify that gender is categorical? If so, how do I write that in the syntax? Thank you in advance and thanks for a wonderful help-page

I have a question about missing value treatment. When I want to conduct FIML instead of EM algorithm, How can I do?

Analysis=missing?

According to your previous response related to missing value treatment,

"Missing H1" says that we want to do ML estimation of an unconstrained (saturated) covariance matrix for the observed variables taking missingness into account under the MAR assumption (see the Little & Rubin book). This ML-MAR estimation is carried out using the EM algorithm in line with the L-B book.

Do you think that the MCMC option in LISREL does the same thing as multiple imputation under NORM or SAS proc MI?

FIML is an estimator and EM is one algorithm for computing FIML estimates. Other algorithms include Quasi-Newton, Fisher Scoring, and Newton-Raphson. Mplus uses the EM algorithm for the unrestricted H1 models and the other algorithms for H0 models.

Saying Analysis type = missing implies using all available data. With FIML this is the standard "MAR" approach to missingness.

Dear Bengt/Linda, I am using type=missing H1 in combination with the WLSMV estimator for ordered categorical dependent variables. Can you tell me how MPLUS 3 deals with missing values in this situation? I have read appendix 6 of your technical appendices, but this appendix is restricted to normally distributed y-variables. Maybe you can give me a lit. reference? Thank you very much. Ad Vermulst

I am trying to do an exploratory factor analysis with both categorical and continuous variables. I have missings in both and I'm getting an error that is telling me that i can only use the missing option if all my dependents are continuous. Is there a way of getting around this? How should I treat my categorical missings?

There is no test for MAR. If one suspects ways in which MAR is violated, non-ignorable missing data modeling can be attempted to see if results differ. Although it is not always easy, you can do non-ignorable modeling in Mplus - see for example the model diagrams posted at

Full information maximum likelihood and multiple imputation are clearly superior to other ad hoc approaches. I am debating which one to use for modeling my path analyses. Does anyone know if MI has clearly advantage over FIML?

I wanted to get a sense for whether or not there is a mathematical and/or conceptual relationship between three approaches to the modeling of non-ignorable missingness - the first two are: a) MI where the missing data pattern indicators are included (along with the variables of interest) in the imputation model but only the variables of interest are included in the analysis model (Schafer, 2003, Stat. Neerlandica) and b) FIML with auxiliary variables where the missingness indicators are additional outcomes predicted by the IV(s) of interest (along with the DVs of substantive interest) with residual correlations between the missingness indicator(s) and the substantive DVs (Graham, 2003, SEM).

I came across Schafer's (2003) suggestion on a simple approach to pattern mixture modeling where he says in contrast to traditional PMMS "......this process of averaging the results across patterns may be carried out by MI. Suppose that we generate imputations Y1mis.....YMmis under a pattern mixture model. Once these imputations exist, we may forget about "R" (the missing data pattern indicators) and use the imputed datasets to estimate the parameters of P(Ycom) directly."(bottom of p.27) (link to the paper on Schafer's site is @

Using R in the imputation model and throwing it out in the analysis model sounded very much like using R as a special-case auxiliary variable a la Graham (2003). In Graham (2003), Collins et al., (2001, Psych. Methods) and elsewhere, the equivalence between MI with auxiliary variables and FIML with auxiliary variables is either discussed or illustrated. But one of the key models that is suggested by Graham (2003) (the correlated residuals model described above) looked very similar to a third model (Muthen/Jo/Brown 03 JASA - specifically the model on page 6 of your lecture17.pdf) except for two things: a) mixtures of longitudinal trajectories (which is not an important difference per se) and b) latent missing data classes (e.g., CU in the diagram) that are correlated with (or at least account for differences in conditional means on) the growth parameters. Now to my real question - assuming the same model structure of interest across the two approaches (e.g., single-population LGM), is it safe or reasonable to say that Graham's (2003) model is a special case of your "CU" JASA model where missing data pattern class is "known" (or at least captured with observed measure(s) of missing data class)?

I like the Schafer and Graham (2002) Psych Methods paper and their discussion of MI and FIML. Consider cases where you have variables (Z, say) that relate to missing data and that don't belong in your model of interest for x and y. With MI you would use z in the imputation model but not in the analysis model. With FIML you would use z as extra y variables that are freely related to y and x.

The modeling with the missing data indicators (u say as in Lecture 17) is different. If you have MAR, modeling the u's in an unrestricted way in addition to x and y gives the same ML results as analyzing x and y only (ignorability of missingness). Modeling the u's aims to handle non-ignorability. Lecture 17 suggests several possible alternatives for doing such u modeling. Page 6 that you point to tries to simplify the u structure. This relates to pattern-mixture modeling where you have to use all missingness patterns as covariates. The pattern-mixture model essentially corresponds to a latent class model (the model with cu) that has as many classes as there are missing data patterns. With a latent class model for u, you essentially reduce the number of patterns to the number of classes. You can combine the u modeling idea of Lecture 17 with the z modeling idea above.

Thanks so much for your response Bengt; it was very helpful. I had an additional question on u modeling in general and cu modeling in particular. Other approaches to NMAR have a mechanism for (weighted) averaging of parms and se's across the missingness patterns such as hand-calculation, equality constraints (e.g., Allison 87, MKH 87), combining via matrix manipulation (HG 97) or the multiple imputation approach to NMAR that Schafer discusses in the .pdf linked above. For CU modeling of NMAR, it seems like constraining the estimates to get a weighted average of the covariate>growth parameter effects (i.e., X>I, X>S) is no problem (of course, modeling X>I and X>S only in the %overall% part of the model is less code to do the same thing). But it also seems like if one is interested in getting a weighted averaged estimate of the growth parameter intercepts (GPIs) (across all the latent missing data groups), ( E[ I | X, CU] and E[ S | X, CU] ), you may not be able to estimate them directly in the analysis because if you constrain the GP intercepts to equality, the problem reduces back to an MAR solution - it seems like constraining the GPIs in each CU class to equality eliminates the relation between the growth parameters and CU which seems like the very part of the model that handles non-ignorability. But if you allow the GPIs to vary across CU, you do not get a single (weighted averaged) estimate. Is my understanding of this off-base? If so, any additional guidance you could provide would definitely be appreciated. If this is not off-base, then would you recommend hand-calculation of the weighted average if one was interested in inferences on the GPIs?

I think your understanding is correct. You don't want to hold these parameters equal across classes, and this does lead to the problem of how one presents the results mixing over classes. I don't think this is resolved, but needs research. On the other hand, with a cu approach you have fewer patterns (number of classes) and therefore perhaps you are interested in presenting the results for each class by itself without weighting (mixing) them together - the classes may be so fundamentally different that you rather treat them separately.

I was wondering what missing data strategy you would recommend for a small longitudinal SEM model? More specifically, I ran a SEM model in which there were 58 subjects at the first time point and only 50 subjects at the second time point (i.e. 8 subjects had missing data). I ran the model two ways (1) with listwise cases deleted and (2) with the means in the place of the missing data. Both models fit the data almost equally well and the same paths were significant in both models. Is the listwise strategy more rigorous than running the model with the means? Does this depend on the percentage of the sample that is missing data? Should I run the model another way?

Mplus uses the EM algorithm for ML estimation under the "MAR" assumption; see the Little & Rubin missing data book. In this approach, missing data are not imputed, but parameters of the model are estimated directly using all available data.

Hello, In Stata, I created a data set that has several multiply imputed data sets. When I try to read this data set into MPLUS, however, I get the same two error messages repeated until the program finally aborts:

-Errors for replication with data file [and then it lists a bunch of numbers].

-*** ERROR in Data command The file specified for the FILE option cannot be found. Check that this file exists: [and then again, a bunch of numbers].

As far as I can tell, the Stata file contains 5 multiply imputed data sets, but do can you tell from the above message if this is problem with the data in Stata or in MPlus?

The message means that the file you have named using the FILE option cannot be found. Perhaps you have misspelled it or it has an extra extension that you are not aware of. If the file contains 5 data sets, you need to separate them if you plan to use the IMPUTATION option of Mplus. If you have further questions on this topic, please send them along with your license number to support@statmodel.com.

In the intro to the Missing Data Modeling Discussion board, there's a reference to a paper I can't find: "Non-ignorable missing data modeling is possible using maximum likelihood where categorical outcomes represent indicators of missingness and where missingness may be influenced by continuous and categorical latent variables (Muthén et al., 2003)." Can you provide a link or more information?

I am attempting to run a MIMIC LCA model with missingness on the covariates.

A colleague of mine recommended: rather than including the x's in the model by mentioning their variances, which would require using integration to estimate the model. To instead create a new variable with mean zero and small variance and give random values to each case. Then regress all the covariates on this random variable. The covariates are then not independent variables in the model and can be missing.

The syntax for this model is as follows (rg is the new, random variable)

Many thanks for your speedy response - i will be certain to pass your thoughts onto my colleague.

Multiple imputation has been our method of choice so far, however the problem is we now want to save the probabilities in a data file, which is something you cannot do when working with multiple datasets.

Can i ask, are you suggesting that in our case, FIML wouldn't really be an option? i.e. use MI and accept that we will not be able to save the probabilities?

If you have more than two or three covariates with missing date, it is impractical to bring the covariates into the model because the computational burden of numerical integration would be heavy. If only two or three of your covariates have missing data, then FIML should be fine. You should study the missing data in your covariates. Perhaps there are some with very little missing data such that you could allow the listwise deletion on those and bring the others into the model.

Missing data estimation using FIML is available for categorical outcomes by using the maximum likelihood estimator. Missing data estimation is also available using the weighted least squares estimator.

I receive the means for each of the variables p1 - p10. However, only the variables that do not have any missing data in them, match up with the means calculated in excel. I am certain that the missing values are set to 999 in MPlus file. Thank you for your help.

i am analyzing some longitudinal data in a cross-lag model and have about 15% of subjects missing data on one variable at the first time point. these same subjects are missing subsequent time points for this specific variable. essentially, for this 15% of the sample, there is no data on this one variable. however, these missing subjects have observations on other variables.

my question is whether i should delete the subjects who are missing this variable, or conduct the analysis on the entire sample by employing a missing data estimation procedure such as FIML? i guess i am not sure if the data is "missing at random". thank you very much for your ideas/suggestions.

I would use missing data estimation even if the data are not missing at random if it meant losing 15 percent of the sample. You might want to do the analysis both ways and see if it affects the interpretation of the results.

I am conducting some analyses using data from NESARC. In a recent article (Grant et al., 2006) analyses were conducted using a sample of past year drinkers (n = 26946). I hope you could answer a query that I have.

I am aware that you have conducted anlayses on this dataset and am interested to know how you and your colleagues dealt with missing data among this sample. I have read in the literature that listwise deletion of missing data is quite popular. I am aware however that the NESARC dataset contains a weighting variable. I have read on the MPlus discussion board that deleting missing data can have an adverse affect on the weighting variable. I want to use the weighting variable in my analyses and I am therefore reluctant to delete cases with missing data.

In an attempt to overcome this problem, I have included the following commands in the input:

Variable: Missing are all (-9);

Type: Complex mixture missing;

However, I am aware that other people have used the algorithm command in their analyses. Is this an appropriate solution to the issue of missing data? If not, what command(s) would you suggest I use/change in my analyses?

I would like to extend on a query from my previous post (July 27 2006). I have missing values for approximately 4% of my data. I am considering recoding my dataset from values of ‘yes' to 'criteria present' and values of ‘no' or 'missing’ to ‘all other responses’.

Do you think that I could statistically defend this treatment of missing data? I am aware that treating unknowns or missing values as negative has a certain element of risk (as there may be false negatives), but given the low proportion of missing data, I am unsure as to whether this is a problem. I was wondering however if you could perhaps suggest any references or authors that may have utilised such a technique?

I would treat the missing as missing and use TYPE=MISSING; I think it is dangerous to start recoding. You may want to search the literature to see if you can find anyone who advocates the approach that you suggest.

I am using Mplus for my dissertation analyses and I want to make sure that I understand how my missing data will be handled. I am using WLSMV (with covariates) and my understanding is that the data will be treated as missing as a function of the covariates. Could someone explain what this means? Thanks so much!

The ML-MAR approach to missing data allows missingness to be predicted by variables that are not missing for the individual. So both y and x variables. However, with WLSMV, if missing is predicted by y variables, the results are distorted, while they are not if missing is predicted by x variables.

Hello Bengt (I am having to send this in 2 parts...), I wanted to follow-up with you on our discussion from March 2006 on this thread about Latent Class Pattern Mixture Models (LCPMMs, i.e., "CU" models for NMAR dropout). With a very small sample (N = 128), I have looked at a series of K-class (i.e., single-class through 4-class) LCPMMs where CU (treatment attendance classes) jointly accounts for a) three-piece linear growth in alcohol use over time across 12 weekly alc. assessments, b) observed measures of treatment attendance (i.e., missingness) from weeks 2-12 of tx (i.e., everyone "shows up" for week 1) and c) the (calendar) week of the trial when the person started treatment. BIC and entropy suggest that a 3-class model fits best and, in fact, when you mix estimates (i.e., growth parameter intercepts, tx effects for each piece) across classes (weighted averaging outside the analysis), you make a different inference than you would have made if you took the results of the 1-class model (e.g., standard LGM under ignorability - but with the missingness indicators left in the model to compare BICs). (Part 1 ends here.....)

(Part 2 starts here....) My question is the 3-class model has 64 parameters - exactly half the number of people in the dataset, which is a dangerously low ratio of people-to-parameters (i.e., 2-to-1 - though the class-specific estimates do not look strange and I reproduce the log likelihood value multiple times with 500 starts). But 33 of those parameters (11 indicators x 3 classes) are the thresholds for the missingness (show/no-show) indicators. Lin et al (2004; Biometrics) say that for CU models for NMAR, data are MAR within each class, after conditioning on class membership - seems to me that once you condition on CU, you could ignore these missingness indicators (just as you would never need the missingness indicators in single-population models) and not be penalized for having such a low ratio because more than half the parameters in the model would not even be there if class membership were known. I wanted to get your thoughts on this and see if this was off-base........

I see what you are saying, but it seems that you cannot get at CU status without estimating those thresholds, so I think they are necessary. It is an empirical question if you do better with such a 64-parameter model than not trying NMAR at all.

Thanks again Bengt. I agree that this would be a very different model w/o the thresholds. I just worried a little bit about the low ratio, especially given that this particular CU model looks better empirically than 1-class model under MAR (though I realize this is not necessarily a "test" for or against MAR). No one has brought up the ratio problem with these data and it seems like it doesn't worry you either.....

I conducted a small sim as part of this work (as part of a poster at Yih-ing Hser's CALDAR conference and a talk I gave Oct 2 at Bud MacCallum's brown bag @ UNC), which focused on confidence interval coverage for the mixing of the class-specific parameters in the meanstructure, using all the class-specific parameter estimates (e.g., growth parameter intercepts, treatment effects, show/no-show thresholds, variance components, etc.) from the 3-class model as population parameters with simulation N=128 (500 replications). I looked at coverage under 1-class through 4-class models, given there were 3 classes in the pop. There were two things that were encouraging for the three-class solution: 1) coverage was excellent for the 4 effects I looked at (weighted average treatment effects on the three linear pieces and the intercept at the last week of treatment), between 92-98% coverage across all replications and 2) no non-converged solutions/local maxima in any replication. Coverage was bad for 2-class and terrible for 1-class, with the majority of the confidence interval misses (relative to the pop. tx effect(s)) coming because the (class-mixed) tx effect was overestimated. 4-class is where the % of non-converged solutions was so high (even with 700 random starts in all conditions), I stopped studying anything beyond 3-class. Does this help?..

I am conducting analyses using data from NESARC, a complex survey design study. My analyses are concerned with a sub-sample of respondents, which I identify in my set-up using the subpopulation command. My query concerns the coding of respondents who are not members of the sub-sample. How should they be coded?

Just to clarify, those respondents who are not included in my subpopulation are coded as missing in the dataset (due to 2 screener questions). I have identified these people in the set-up (missing are all -9 and type = mixture complex missing). Is this correct?

Also, in my output, should the number of observations reflect the number of respondents in my subsample or rather the entire sample?

I wanted to follow-up on an Oct 2006 thread on Latent Class Pattern Mixtures (e.g., MJB, 2003) on an issue that probably comes up in any K=>2 GMM. In working with the covariance matrix of the estimates (covb) (and a Jacobian matrix of 1st-order derivatives) to generate delta method standard errors for weighted-averaged estimates from LCPMM, I noticed that there were non-zero covariances *across* classes. I initially thought that was strange, as I was expecting covb to be block-diagonal (0s for all parameter covariances across classes). But then I wondered if these non-zero covariances were one of the places in the model where the uncertainty in class membership was reflected; in fact the two class combination (in my K=3 model) that has the largest off-diagonal in the matrix of average latent class probabilities also has the largest cross-class covariances. The other sets of cross-class covariances are 0 (or so small as to be functionally 0). Are my suspicions on-base? If not, any explanation as to why covb isn't block diagonal would be very helpful......

Dear Linda and/or Bengt, I am analyzing data from a longitudinal study of risk for anxiety disorders and depression in 600+ high school juniors. At T1, we obtained self-reports on vulnerability measures for all subjects and tried to obtain peer-report versions of the same measures for all subjects. However, because some subjects refused to nominate peers and some peers refused to participate, we actually obtained peer-reports on roughly 50% of our subjects only. I was thinking about incorporating the missing data by using the multiple group approach to missing data. However, I have come across some references suggesting that the FIML approach to missing data is conceptually equivalent to the multiple group approach. If this is true, it certainly seems preferable to me to go the FIML route based on ease of model specification. Can you confirm that these approaches are equivalent? If not, when would you use the one and when would you use the other. Thanks for your time!

Look to see if the output says that individuals with missing on all variables or missing on x variables are deleted. If you don't see this, please send your input, output, data and license number to support@statmodel.com.

I'd like to use estimated sigma within and between matrices for multilevel regression and path analysis. In part, these matrices would serve as input for multiple group analysis.

Many variables in my dataset are treated as covariates. As far as I know, covariate missingness leads to listwise deletion when using FIML.

When using the sigma matrices as input for analysis with covariate missingness, I wonder what would be the right N for the sample/the groups. Has missingness in covariates to be taken in account to determine N for covariance matrix input?

For the pooled-within matrix use the sample size shown in the output where you saved the pooled-within matrix minus the number of clusters. This takes into account any observations lost due to missing data. For the sigma between matrix the sample size is the number of clusters.

Hi, I am trying to run a two-level path analysis but am having trouble estimating the missing data. When I take out the level two data and just run it as a path analysis, the model successfully estimates the missing data. But when I add in the level two data, it stops working. This is strange because all of the missing data is level 1 data. Any suggestions? Thanks for your time!

1. While using "Type = imputation " option, how does Mplus generate S.E. of the estimates? Does it apply Rubin's rules? I compared the results of the same model with FIML and MI , the S.E. is quite different.

2. How can I request the output of relative efficiency, Relative Increasein Variance, Fraction Missing Information using MPLUS?

1. We estimate standard errors for multiple imputation according to the Schafer 1997 reference listed in the user's guide. FIML and MI are asymptotically equivalent. Differences can come about with small samples.

This is a follow-up question to my previous inquiry about analyzing data using multiple imputation.

I generated multiple imputed data sets (40 replications) using PROC MI in SAS. I then analyzed the data using both Mplus and PROC CALIS, taking into account that the data were generated by multiple imputation methods. Below are examples of the resulting parameter estimates with standard errors in parentheses. For comparison, I have also included estimates obtained from Mplus using full information maximum likelihood.

It is interesting to note that the standard error obtained using PROC CALIS based on MI is quite comparable to that obtained using Mplus with FIML. The result from Mplus based on MI is strikingly different. How does one account for the large discrepancy?

I would need more information to comment on this. Given that the parameter estimates are simple averages over the replications, I wonder why they are different. Unless, they are the same, I wouldn't expect the standard errors to be the same. If you send the three outputs and your license number to support@statmodel.com, I can take a look at it.

I am trying to fit a structural model with imputed data set generated through the procedure of ICE in STATA. One of the endogenous variable is a dummy. After I fit the model, I do get averaged CFI TLI RMSEA and their respective starndard deviation. However, I did not get those for the Chi-square? How can I get them?

My version of Mplus is 4.21. When I changed the type of all my endogenous vairalbes as continuous, I can get the averaged Chi-square and its standard deviation. But if some endogenous variables are categorical, I cannot get the averaged Chi-square and standard deviation. Is there some constraint for the WLSMV estimators?

With WLSMV, the chi-square test statistic and the degrees of freedom are adjusted to obtain a correct p-value. So the degrees of freedom varies across the replications and therefore we do not report its average.

Monte Carlo integration can be useful with many dimensions of integration and in other special cases described in the user's guide. You can search for this in the computational statistics literature. I don't know of a particular reference offhand.

Hi Bengt and Linda, I am interested in using Mplus for fitting a growth model to a data set with missing values on the outcome variable. I could use TYPE= RANDOM MISSING and the model would produce factor scores and other estimates under a MAR assumption (missing data mechanism depends on observed data) However, my question is the following. If I want to model the missing data mechanism as in Diggle and Kenward, I could use the "missing data indicator at time t ON outcome at time (t-1)" (u ON y) type of code but would still need the TYPE=MISSING bit of code to avoid listwise deletion. Am I not "overriding" the missing data code with the inclusion of TYPE= MISSING? Should not factor scores obtained under the two model specifications (with and without the explicit missing data model) be different due to the presence of the model for the missing data mechanism as I am only including "outcome at time (t-1)" in the missing data model? thanks,Graciela

The alternative of using missing data indicators (u_t on y*_t-1) in the modeling (so allowing for MNAR) takes the same approach of using all available data as MAR does, so Type = Missing does not override this. Factor scores should come out different in the two approaches given that the models are different.

In a conditional model, information on x does not contribute to the estimation of the regression coefficient in the regression of y on x, and the mean and variance of x are not estimated. So an observation with only information on x is not be used because it has no information to contribute.

In an unconditional model where the means, variances, and covariance between y and x are estimated, cases with information only o x are included in the analysis.

The only thing you can do to avoid this exclusion is to mention the variances of the x variables that have missing on x in the MODEL command. This will cause them to be treated as y variables. Their means, variances, and covariances will be estimated and distributional assumptions will be made about them.

Thank you for your quick reply. Would you suggest estimating the (co)variances for the x-side variables (and declaring nonnormal variables on the categorical line)? Could this cause other problems or violate assumptions?

I would not do this. If your x variables are continuous normal, it would probably be okay and in line with multiple imputation programs. If they are categorical, it would change the model. The bottom line is that if you are interested in regression coefficients, bringing the cases with only x's into the model will not change the results.

Have you published and/or are you aware of any articles that illustrate methods for modeling the conditional expectation of the likelihood given the data and current values of the parameter set (i.e., EM for model parameters) for conventional (single-class, single-level) SEMs? I am either coming across applications of EM a) for means and covariances (e.g., Schafer, 1997, p.163-181), b) for model parameters within the multilevel track (e.g., Lee & Poon, 1998, Stat. Sinica; Liang & Bentler, 2004, Psychometrika) or c) for model parameters in the mixture track (Muthén/Shedden, 1999). For b and c, some places are obvious where the model, within the E-step, would be modified/structured to fit a conventional model as a special case - and not-so-obvious in other places. And many texts that discuss casewise ML for MAR data for conventional SEM seem to say little about EM, N-R or any other optimization techniques (while there is plenty of talk on this for multilevel and/or mixture SEM). Any help either of you could provide on this would be greatly appreciated………

Won't be able to get them until Mon (have hardcopy access but no electronic). The 2nd Rubin paper appears to be a response to the paper by Bentler & Jeff Tanaka where both groups traded concerns about susceptibility of another optimization method (N-R maybe? I'll see on Monday...) to local maxima. Thanks for pointing me in the right direction......

Hi Linda or Bengt, I know that a model with more parameters than subjects has identification problems (presumably related to the fact that the rank of the data matrix is limited by the number of subjects in this case) but I am not clear on how missing data impacts this. If I am fitting a model say with 200 parameters free to be estimated and 600 subjects total but have complete data from say 150 subjects (self-report is collected from all 600 but more expensive measures such as peer-report and diagnostic interview are collected from subsamples) would we have the same identification problems that we would have had if we didn't have the 450 subjects with partial data? or do the additional subjects with partial data help in this regard? Thanks for any insight you can provide!

thanks for the very quick reply Linda! I don't understand the last part about parameters for which we have little information to compute the H1 model. If you woulod be able to elaborate a bit I would appreciate it and am not sure which parameter those would be.

Dear Drs Muthen, this can be a very silly question but I am struggling to figure out how Mplus handles missing data (MD). Well, I hava a dataset of 8028 people, with 248 variables (between categorical and continuous) including missing data in all of them. I know that Mplus 5 takes into account MD by default, but I want to know why the number of analysed cases differ according to the number of variables in the dataset. See this example, when I regress age on sex (none have MD) the number of analysed cases is only 5189 instead of the 8028. On the other hand, when I create a new dataset including only age and sex as variables the number of analysed cases is 8028. Why are these differences? Is there any default mechanism that I am missing?

I am interested in LGC with multiple indicators and multilevel growth mixture models. As my dataset is "different" in a way, I would like to hear your advice on how to handle missing data in it using Mplus.

The dataset consists of several variables (cont. & cat.) across 5 timepoints on the item level (every variable shows up 5 times) and is hierarchical with individuals nested in groups. The whole dataset is related to one country. The special thing about the dataset is that there is a high fluctuation across groups (that I can control), but also above countries (that I can't control). Thus, individuals sometimes change the group. In addition to this, sometimes they leave the country (and probably return later), what appears as missing value in the time of absence.

Overall, I have about 1000 individuals nested in about 25 groups with all 5 timepoints available for 106, 4 timepoints for 87, 3 timepoints for 165, 2 timepoints for 220 and 1 timepoint for 442 variables. I do not necessarily want to explain the absence.

If you use group as a level 2, you are treating group as a random mode of variation. In this case, changing group membership implies the need to use a "crossed random effects" approach (see the multilevel lit.), which Mplus currently cannot do.

The leaving the country at some time points is a missing data question which probably is handled fine by the standard ML MAR approach of Mplus. But you want to pay attention to coverage that is lower than the Mplus default limit of 0.10 (which can be altered) - that may already be too low (just put there to prevent convergence problems) for seriously relying on the results. It depends on where the low coverage occurs. If it is for a covariance between say the first and last time point, but coverage is otherwise high, then that is not so problematic because you typically don't have a growth model parameter corresponding to the covariance between first and last. A problem would be if low coverage happens for a variable (the diagonal of the coverage report) or for variables at close time points.

We have a data set of 300 adolescents who were sampled at three waves in a cohort sequential design. It's a typical longitudinal data set and we have a reasonable amount of missingness.

We did a series of unconditional and conditional LGM models to describe and predict constructs and concluded our paper with a conditional parallel process model between two constructs. To maximize our power and sample, we included cases that had data at two time points and used the missing estimator (our rationale being that two points provides a line if not a curve). When we looked at our data everything made sense and we wrote up and submitted our manuscript.

Upon receiving our reviews to this and two similar papers, we received consistent critiques that I'm hoping you can help me clarify for the reviewers. I'm writing today to ask if you can direct me to literature that address these issues.

1) Reviewers were concerned that latent growth curve models cannot be properly identified or stably fit with 3 (or less) time points. Is there evidence that the models are trust-worthy?

2) Is our rationale to keep cases with at least two time points and use the missing estimator something we can justify?

3) For some of our estimates, the size of the estimate is very small (e.g., slope = .008). Although significant, how are small estimates to be interpreted?

1) Typically, at least 4 time points is desirable for good growth modeling. With only 3 time points, there are several model mis specification risks that cannot be countered due to having too few time points to identify more flexible models. This is discussed in our Mplus Short Courses, Topic 3 (see videos and handouts on our web site). Still, many published studies have used only 3 time points. If all individuals have only 2 time points, only a very limited growth model is possible.

2) It sounds like you have 3 time points for a majority of individuals and 2 for some. The percentages of each should be given. And, in fact, you could have included individuals with only 1 time point. This is what ML estimation under the "MAR" assumption of missing data theory (see the Little & Rubin book) would do. If a majority has 3 time points, I don't see a serious problem with this approach.

3) I think you are talking about a slope mean. The size of this depends on the time scores. The real question is what the implied change in mean is for the outcome from one time point to the next. You find that in the Mplus output when requesting RESIDUAL.

I am working on fitting growth models to survey data and have a question about missing data.

There were 233 students in the sample. We collected data at 4 equally spaced time points. With regards to our key variables used to fit growth models, here is the breakdown of how often students provided data:

107 students provided data at all 4 time points (46%) 64 students provided data at 3 points (27%) 36 students provided data at 2 points (15%) 23 students provided data at 1 point (10%) 3 students did not provide any data (1%)

I have read somewhere that for growth models, we need at least 1 time point -- but I am not sure if having close to 10% people that provided only 1 observation will affect our growth models.

Assume that you have a linear growth model. The most important factor is how many individuals have at least 3 time points because that's how many you need to identify all the parameters. The individuals with fewer time points also contribute to the estimation of some parameters so they are helpful to include. Of those who have at least 3 time points you also want to know how representative they are of the whole group - a simple thing to check is if the mean of the outcome at the first time point is significantly different across the 4 missing data groups you list. If different, you may consider "pattern-mixture modeling".

In an earlier posting, it was mentioned that FIML was available for categorical outcomes. However, whenever I have tried this I get a warning stating, "Data set contains cases with missing on x-variables. These cases were not included in the analysis." This has been the case when I have run logistic regression analyses and when I have run SEM models with binary indicators of latent variables. Can you clear this up for me? Is there a way I can get Mplus to use FIML with such analyses?

A regression model is estimated conditioned on the observed exogenous variables. Means, variances, and covariances of these variables are not part of the regression model. Missing data theory applies to observed endogenous variables. You can include the observed exogenous variables in the model by mentioning their variances in the MODEL command. By doing this they are treated as endogenous variables and distributional assumptions are made about them.

I have missing data question that I was hoping someone could answer. We developed a ten factor measure of connectedness, with one factor measuring ting connectedness to sibling. As expected, some of our subjects do not have siblings and thus their data is missing appropriately. To avoid losing those subjects on the other factors, I estimated the missing data using a multiple imputation procedure and followed it with an invariance analysis that compared siblings and non-sibling samples across the factor loadings, intercepts, residuals, and covariance matrix.

My first question is do you find this analytic approach appropriate? As I expected, the results were nearly identical across the two samples, both when testing the ten factor model and single factor model (i.e., sibling connectedness scale). The only caveat is that the sibling connectedness results should only generalize to subjects with siblings.

My second question is whether mixture modeling with known classes is a better approach to answer this question. If my understanding of mixture modeling is correct, I would draw the same conclusion. Am I correct?

I would look at subjects with and without siblings separately. Then if you want to compare them, do so on the factors that are not about sibling connectedness. Imputing siblings for those without seems a little iffy.

Mixture modeling with only a known class variable is the same as multiple group analysis.

Hello, I am trying to determine how to approach a missing data issue. I have ratings of depression severity across time for about N=400. The timing of observations varies across individuals, so I plan to nest time points within individuals. One problem, however, is that the number of data points also varies across individuals. For instance, the number of data points for the sample ranges from 1-16, with a mean of 8 data points, SD=2.8, and variance=7.7. I am not certain what would be the best way to approach this. For example, would it be best to include only the first 8 time-points for the analysis? Any thoughts would be very much appreciated.

I would include all the data. The varying timings is handled by the AT option of the growth language (using |) and the varying number of time points can be handled either by

(1) using a single-level, wide approach letting the observation vector be of length 16 and using a missing data symbol for time points not available

or

(2) using a two-level (time points within individuals), long approach with a univariate outcome, where the different number of time points per individual is merely resulting in different cluster sizes and is therefore inconsequential.

Hi Dr. Muthen, I have attempted a LGMM using the first option, however the model does not converge. Do you think that convergence would be more feasible with the second option? Thanks very much for your help!

The error I obtain is the following: *** ERROR One or more variables have a variance of zero. Check your data and format statement.

There is one variable with only 3 subjects and the variance is 0.027. However, Mplus indicates that the variance is 0.000 for this variable. Is this possible or is the data file incorrect? Many thanks!

I know that Mplus does not generate graphs when the type=random analysis is used to account for individually-varying times of observation. I'm guessing this is also true if type=random is used in the LGMM framework, correct?

I'm using the ECLS-B database from NCES. There are about 12 weights to be applied to various variable sets. I think I've identified the correct weight for my variables and should receive confirmation from NCES soon.

However, even though the output in SPSS shows I have a variable weighted, my Mplus output shows only something like 51 cases were analyzed. There are missing weight values for some cases, predictably. And yet, I'm told I cannot impute any value, neither a 1 or 0 for example, in Mplus.

Do you have any transformation or filtering suggestions, so that I can do a CFA with a larger sample?

Hello Dr. Muthen, Regarding your response to my query concerning how to approach missing data (Bengt O. Muthen posted on Thursday, February 26, 2009 - 6:53 pm), what are the advantages and disadvantages to taking a long vs. wide approach? I have more familiarity with the wide approach and would prefer it, but will certainly consider the long approach if it has definite advantages over missing data. Also, a few more questions: 1) Can the long approach be used in conjunction with a LGMM? 2) Is it possible to graph the classes of trajectories with an HLM that uses the analysis=random option? Thanks very much!

I think the wide approach is generally preferable, but not always. For example, you can allow the residual variances for the outcomes to vary across time. But the wide approach has to use the max number of observations per subject which may lead to a long observation vector (very wide). And with individually-varying times of observation having a different residual variance for each time point makes for many parameters. Furthermore, some time points may not have variation in the outcome if the missingness is extreme.

1) Yes. In this case the latent class variable is a between-level variable (see UG for examples).

Hi Dr. Muthen, Yes, using the wide approach, I've found that some time points do not have variation in the outcome b/c the missingness is extreme. I was thinking of simply only including time points where the covariance coverage equals or exceeds .60 (although I have no reference to justify this approach). 1. Does this seem to be an acceptable solution? 2. Do you know of any references that recommend such a covariance coverage? thanks!

You can manipulate the data to fit better with the wide approach by deleting time points or combining them with adjacent timepoints, but such manipulation does not seem right. Given what you see, I would instead take the long approach. You may find the DATA WIDETOLONG option helpful.

Just wondering if it's possible in any way to run MLM estimation when having missing data. I have noticed that MLM requires the raw data (so it must be a FIML type estimation) so even if I feed the model with a covariance matrix it won't work.

Hello Dr. Muthen, I've attempted to transform the data from wide to long per your suggestion (Bengt O. Muthen posted on Thursday, March 05, 2009 - 8:13 am). Thankfully, the model ran! However, I have a few questions regarding interpretation: 1) How might I obtain a graph of the LGM trajectory? When I attempt to view the individually-fitted curves, only two data points are plotted on the y-axis. 2) If I am using the long option, I no longer need TSCORES correct? 3) How might I compare LGM models? When using the wide approach, I've conducted chi-square diff tests for nested models (intercept vs. intercept + slope vs. Intercept + slope + quadratic slope), but I am not certain how to do this using the log likelihood test. Many thanks!

In the newer versions of Mplus, TYPE = MISSING is the default, where missing cases are handled under the Missing at Random (MAR) assumption using Full-Information Maximum Likelihood (FIML). You may also specify models with listwise deletion through LISTWISE=ON in the DATA-command. More information is provided in the User's Guide, pp. 7-8.

I've received this comment from a reviewer, regarding a confirmatory factor analytic study:

"Were missing data patterns missing at random (this can be done in Mplus by specifying a mixture analysis and using only a single class latent variable, using the %OVERALL% syntax at the beginning of the model statement and declaring the outcome variable as categorical variables)."

I don't understand what they are suggesting, and even if I did understand, I don't see how any test could tell if the data were MAR/MCAR vs MNAR.

I have mplus version 5, I am running a path analyis and I understand that the default is to estimate the model under missing data theory. How can I turn off this option? I just want to use complete case analysis in order to compare my results with another package. Thank you

This option came out with Version 5. Perhaps you are using an older version where listwise deletion is the default. If not you need to send your full output and license number to support@statmodel.com.

A reviewer of one of my manuscripts requested that I report how Mplus handles missing data. I have a complex structural equation model (see below). I used the WLSMV estimator and MISSING = ALL (999). The outcome variable is categorical (1=relapse, 0=abstinent) and no subjects are missing on this variable. However, some subjects have missing data on some of the other observed variables. For instance, some subjects do not have data for c1-c4 (each of the observed variables that make up the crave latent variable). Is my description of what Mplus does in this situation correct? Syntax for the model is below.

“Intent to treat abstinence was the dependent variable in the current study. Thus, none of the participants were missing on the dependent variable (i.e., missing were counted as relapse). However, some participants did not complete all of the study measures. Mplus handles these missing values by estimating them using the other variables in the model.”

MODEL: SES by s1 s2 s3 s4; Neigh by h1-h4; Support by i1-i3; NA by n1-n4; agency by a1-a5; Crave by c1-c4; neigh on ses; support on ses neigh; NA on neigh support crave; agency on crave na; w4itt on agency ses;

Factor indicators are dependent variables. For censored and categorical outcomes using weighted least squares estimation, missingness is allowed to be a function of the observed covariates but not the observed outcomes. When there are no covariates in the model, this is analogous to pairwise present analysis.

I think that what you are saying is that all of your dependent variables are continuous except abstinence. If this is the case, I would use maximum likelihood estimation where maximum likelihood estimation under MCAR (missing completely at random) and MAR (missing at random; Little & Rubin, 2002) is available for continuous, censored, binary, ordered categorical (ordinal), unordered categorical (nominal), counts, or combinations of these variable types. MAR means that missingness can be a function of observed covariates and observed outcomes.

I am interested in receiving your suggestions for analyzing my data using Mplus. The data comes from an intervention study for couples transitioning to marriage. 18 couples completed pre-test data and 12 couples completed post-test data. I am interested in assessing change in couple's attachment, affect regulation, empathy, and trust (continuous variables) following intervention. This was a very preliminary study which is why I want to use the FIML capabilities of Mplus to keep the sample size higher than 12.

The standard approach to analyze longitudinal data is to use FIML under the "MAR" assumption (see missing data lit.). This means that you use all available data - 18 couples at time 1 and 12 couples at time 2. I assume that the 12 couples is a subset of the 18. Couple, not individual, represents the mode of variation for which independent observations is assumed to hold, so the sample size is 18. Because of this, note that with only 2 timepoints the sample size of 18 is quite low and does not allow the estimation of a model with many variables and parameters.

Thanks so much Linda! Just to confirm--so, I will need to plug in the regression coefficients in the equation to calculate the predicted values using the DEFINE command, and then use the SAVEDATA command to save it, right?

"In the newer versions of Mplus, TYPE = MISSING is the default, where missing cases are handled under the Missing at Random (MAR) assumption using Full-Information Maximum Likelihood (FIML)."

And then was followed up with a statement by Bengt:

"With ML estimators all available data are used, using "MAR"'

I have seen this sort of wording, "all available data are used", by Drs. Muthen in regard to missing data in several places but I have not seen either of them directly state that when using TYPE=MISSING FIML is being employed.

Is is fair to say that when you specify TYPE=MISSING (which is now the default) MPLUS is using FIML?

"FIML" is used in some literature to mean full-information maximum-likelihood estimation (most often with continuous outcomes, but that is not necessary) and with missing data the "MAR" assumption of missing data theory is utilized. (As an aside, I think the "full-information" part is superfluous because maximimu-likelihood estimation uses full information; to me it is not a good idea to add unnecessary acronyms beyond those in mainstream statistics.) Mplus uses ML to refer to maximum-likelihood estimation. ML under MAR is therefore the same as "FIML" and uses all available data.

So, TYPE=MISSING together with ESTIMATOR=ML gives "FIML". TYPE=MISSING together with ESTIMATOR=WLSMV, however, does not use MAR but a less flexible assumption detailed in the UG.

I am running growth models with a lot of missing data. I need to compare nested models. However, my understanding is that it is not appropriate to use traditional chi-square difference tests to compare the fit of the nested models when modeling missing data due to the approximated chi-square values. Further, the manual states that the DIFFTEST option can only be used with MLMV or WLSMV estimators, yet I am using the default (ML) estimator. What is the most appropriate way to compare the relative fit of the nested models in this case? Should I be changing my estimator, or using some other approach?

The presence of missing data should not be an issue with difference testing. It is only the estimator that dictates the type of difference testing. ML uses a simple difference in chi-square. MLR requires the use of a scaling correction factor. Estimators ending in MV can use the DIFFTEST option.

Thank you for your response. I realize that ML chi-square is typically the difference in the chi-square values (with the difference in the degrees of freedom as the df). However, when I estimate these models using regular ML estimation, the df change between samples. For example, if I run a model with sample A, and then run the exact same code with sample B, my chi-square df changes from 70 to 71, respectively. This implies to me that regular chi-square difference testing might not be ok. Am I totally off base?

If the model is the same, changing the sample does not change the degrees of freedom with ML. If you send the two outputs and your license number to support@statmodel.com, I will find the explanation of this difference.

Both models estimate the same number of parameters. The difference in degrees of freedom is due to a different number of parameters in the unrestricted models due to different patterns of missing data in the two samples.

Do you deem it necessary to conduct an analysis of sample selectivity when FIML is used? I thought of comparing those with full data with those having at least one missing on the main study variables of interest. However, I'm not sure whether this analysis is theoretically needed because FIML uses all data available to estimate the model. In case there a differences between both groups...is MAR violated?

MAR is not necessarily violated - the missingness can still be predicted by the variables that are observed. You cannot test if MAR holds. Although of interest in itself, your comparison can only reject MCAR. So unless you try to get into NMAR (not missing at random) modeling, you might just as well go ahead with ML under MAR (i.e. what is often called FIML).

I'm trying to understand how missingness in x variables are handled in MPLUS. I have tried the simplest case with two continuous variables based on a sample size N=415 with X missing 22 cases while Y is missing only 1 case. If I regress y on x I get a message that 1 case is missing on all variables (N = 414). I had expected a message indicating that the analysis would be based on 393 cases (415 - 22). Is the analysis based on 414 cases or on 393 cases? (i.e., a listwise deletion or are the cases with missing Xs somehow adjusted for missingness rather than ommitted from the analysis). I tried to find information on this and don't understand one of your statements that "Covariate missingness can be modeled if the covariates are explicitly brought into the model and given a distributional assumption." Have I done this in my example? Thank you.

Your analysis uses TYPE=GENERAL with continuous outcomes. In this special case, there is no difference between estimating the model conditioned on the x variables or treating the x variables as y variables. This is why the 22 cases are not deleted from the analysis. In other cases, it does make a difference how the x variables are treated and cases with missing on x are deleted unless they are explicitly brought into the model by, for example, mentioning their variances in the MODEL command. In this case, they are treated as y variables and distributional assumptions are made about them.

I doing a longitudinal study of 1000 children followed at four time points to assess language and literacy growth. Since this study is still ongoing there are some children that not have been assessed at time 3 and time 4 yet. In one of my papers I'm focusing on time 2-time 4, doing SEM, to examine how variuos language skills are related to later literacy development. I'm not very familar with missing, but in my data I have some missing values due to the fact that some children have not been assessed yet. What type of missing is this, and how do I handle it?

I want to compare a measurement model obtained from a complete sample (N=1041) with the same measurement model obtained by multiple imputation using the same data with approximately 30% planned missingness MCAR. I want to see if the MI approach gets close to the original measurement model in a real data set. The items are are scaled on 7-point Likert

I have managed to run the measurement model using both methods and the models look similar but I wondered whether the data could be combined in one measurement invariance type analysis (multigroup?).

Is this possible?

For the MI analysis I used a .dat file with the names of the 30 imputed datafiles.

The complete-data sample and the MI samples are not independent so multigroup analysis would not be correct.

What you could do is to divide the sample into groups that have different planned missingness (variables for which everyone has data plus variables that some have data for) and then do a multigroup analysis where you can test invariance over model parts that are in common for the different groups. So this would not use MI.

I was wondering how Mplus handles missing data with WLSM in categorical factor analysis?

I thought Mplus handled missing data using maximum likelihood, but when I run the following analysis code: TYPE = COMPLEX EFA 1 5 MISSING; the output says the program used the WLSM estimator so how could the program also be using the ML estimator?

We have collected student self-report data at seven time points and are interested in doing MM, which may lead into GMM or LGM, depending on the results of the MM. However, we have missing data (total n = 1434; listwise n = ~1271). We have determined that the missing data are not MCAR, and for now are treating them as MAR (will eventually do MNAR models, but are starting with MAR). We would like to do FIML.

2. We have quite a number of external covariates, which we are hoping to use with the auxiliary command. However, some of the external covariate data are missing, as well. Can we use these data with FIML and the auxiliary command? Or, what is your recommendation?

By MI I assume you mean Multiple Imputation. I don't know about MI software with a mixture (I assume your MM notation means mixture modeling), but perhaps you mean doing MI for subjects grouped by most likely class, which might be an alright approximation. But perhaps you could simply do MI for the external covariates without involving mixtures.

If your substantive model can reasonably be extended to include those multiply imputed external covariates among your other covariates, that might be the most straightforward approach. Otherwise, you can include the externals as auxiliaries, either with them imputed or with their missingness.

No model results are shown (at least only the estimate is shown without s.e., p-values, MI etcetera) and I receive the following text: MAXIMUM LOG-LIKELIHOOD VALUE FOR THE UNRESTRICTED (H1) MODEL IS -63112.478 NO CONVERGENCE. NUMBER OF ITERATIONS EXCEEDED.

I have already tried to increase the number of iterations but this didn't help. Can the high number of missing values explain this error (number of missing data patterns is 133). If yes, how can I solve this? If not, do you have another suggestion that explains this error? Thank you very much for your help. Best wises!

I am wondering the best way to handle a standard CFA with dichotomous indicators where 1 indicator has missing data for all members of a dichotomous covariate? I get the following error when I run the model:

THE WEIGHT MATRIX PART OF VARIABLE AMEN IS NON-INVERTIBLE. THIS MAY BE DUE TO ONE OR MORE CATEGORIES HAVING TOO FEW OBSERVATIONS. CHECK YOUR DATA AND/OR COLLAPSE THE CATEGORIES FOR THIS VARIABLE. PROBLEM INVOLVING THE REGRESSION OF AMEN ON GENDER. THE PROBLEM MAY BE CAUSED BY AN EMPTY CELL IN THE BIVARIATE TABLE.

Dear Dr. Muthen, A quick question: I'm using the MLR estimator for CFA analyses. I have opted to use multiple imputation in order to test CFA models separately in multiple waves of data (sample size precludes normal temporal invariance investigation using FIML).

Given the robust estimation, I am concerned that MPLUS is not providing a scaled correction factor in the imputed results. Is this a valid concern? Do I need to compute the scaling factor? and if so, how? Thanks in advance, Brian

I'm not sure that using multiple imputation rather than FIML helps with a small sample size. You can test for invariance over time without looking at each time point separately. See the Topic 4 course handout starting with Slide 78 where multiple indicator growth is shown. The first steps test for measurement invariance.

If you are using TYPE=IMPUTATION and MLR, you will obtain an average MLR chi-square and standard deviation over imputations. These chi-square values have been corrected using the scaling correction factor. How to use a scaling correction factor with multiple imputation is a research question.

I am running a latent profile analysis with imputed data. I have generated 40 imputations with SAS proc mi, and created an ASCII file containing the names of the 40 data sets as described in the Mplus User’s guide. Although I am able to get the LPA models to converge I am concerned about the range of the indicator estimates across the classes - I have three continuous variables as indicators, all of which have been z-scored. Is it possible that the profiles may well change meaning from imputation to imputation in Mplus? In other words, across the 40 datasets is it necessary to verify that profile 1 always has the same meaning across imputations, as do profile 2, profile 3, etc. How does Mplus handle this?

I got a question over handling missing data in SEM analysis for panel studies. Marini, Olsen, & Rubin (1980) suggest this method should be used in nested pattern missing data, that is every subsequent wave time should be a sub sample of previous wave, like this:

t1 t2 t3 n 1 1 1 n 1 1 0 n 1 0 0 etc...

A few reviews I found, don't clearly make statements on this issue (Enders, 2001; Newman, 2003). For example, what if the pattern missing data is like this:

My concern is what it is most recommended to do with the not nested cases of the available data? to drop them out, or to hold them for the analysis of panel data when using ML?

Another concern, is the 'few cases' in the panel paths; is only a problem of sample amount (enough data to estimate the parameters), or there is a relation of between the N amount of the within covariances (panel cases) versus the between covariances (cross cases)?

I'll welcome any comments or directions on this issue, thanks in advance!

I think you make a distinction between dropout (monotone missingness) and intermittent missingness. I would think it is ok to make the standard MAR assumption for intermittent missing; perhaps it is even MCAR. You should certainly keep these cases in your analyses. The principle should be to use all available data. MAR for dropout may hold close enough, but for dropout one may want to also investigate other modeling (see for instance my paper under Missing data). But this is more advanced since it means that the missingness is part of what you model.

You then bring up coverage which Mplus prints for each outcome and pairs of outcomes. You want both types to be high in a longitudinal model.

Dr. Muthen, I'm running a simple model examining one indirect effect with one mediator. I have missing values on all variables (x, m, and y) and mentioned the x variable in the model command with the aim of FIML handling all missing data. However, Mplus is dropping cases that have missing values on all variables. Can FIML not address such cases? When I run the same model in Amos (which from what I understand also uses FIML) it appears to use the entire sample. Can you please explain what is happening here? Thank you.

Thank you. There are other variables not in the model that these cases have values on. Would Mplus stop dropping the cases if I brought these in as auxiliary variables? If so, do I only mention these variables in the auxiliary command or do they also need to be mentioned in the usevariables command?

I have a question about the individual LL values output under the SAVEDATA option. In trying to reproduce individual LogLikelihood values from a single-rep simulated dataset under MAR missingness, the values I calculated for a single case (in Proc IML) were slightly different than the value(s) produced in the SAVEDATA output. I was originally using the model-implied means and covariances under H0 to calculate the LogLike in Proc IML but then switched to the H1 means and covariances; the H1 sufficient stats seemed to reproduce the proper individual LL values in the output dataset. So a) am I right in my understanding that the LL values under the SAVEDATA command are the H1 LL values and, if so, b) is there anyway to also output the individual values under H0?

More for illustrative purposes for case-level discrepancies between H1 and H0 LLs - was thinking about them for a module on FIML for a seminar on missing. Your point is well taken though because is difficult to think of how H1 LLs would be useful in practice when your concern is H0 in real applications.

Just to be sure of myself: when type=missing is specified, what is the default method that Mplus uses to handle missing data? Or is this a function of the model specified. In my case, it is a simple path analysis; all variables are observed.

Mplus provides maximum likelihood estimation under MCAR (missing completely at random), MAR (missing at random), and NMAR (not missing at random) for continuous, censored, binary, ordered categorical (ordinal), unordered categorical (nominal), counts, or combinations of these variable types (Little & Rubin, 2002). MAR means that missingness can be a function of observed covariates and observed outcomes. For censored and categorical outcomes using weighted least squares estimation, missingness is allowed to be a function of the observed covariates but not the observed outcomes. When there are no covariates in the model, this is analogous to pairwise present analysis.

I am working on the prospectus for my dissertation in which I will be using a national (NCES) longitudinal dataset. A number of exogenous variables are categorical, as is the mediating variable (5 levels) and the outcome (7 levels- so could be considered continuous). I have some missingness (data meet MAR)- seldom more than 10% and a good deal less in most cases. I have two questions, and I apologize if these seem horribly basic:

1) Is it best to use the WLSMV estimator?

and, if so,

2) Do I need to employ MI to deal with missingness as a first step? I have had some faculty at a training I attended suggest that it is always better to impute first, even in the SEM framework.

I'm still learning, so I'm having trouble understanding when and why I would or would not be wise to use MI.

With categorical outcomes, you can use either weighted least squares or maximum likelihood estimation. If your model has more than four factors, maximum likelihood would not be feasible because numerical integration is required. If you use maximum likelihood, you can use the default missing data estimation which is asymptotically equivalent to multiple imputation. If you use weighted least squares estimation, I would use multiple imputation because missing data estimation with weighted least squares estimation is not as good as with maximum likelihood.

I am a new user of Mplus. I am trying to run latent class analysis. The data files covers to waves of data. The data file has complex sample survey features. It has stratification, clustering, and weights.

And I need to use subpopulation option. The sample is wave4 sample.

My questions are:

(1) after I limit my sample to wave 4 sample, my data still has missing on all variables. Income is a continous variables and all others are categorical variables. Am I able to use Full Information Maximum Likelihood to deal with missingness. I googled somewhere and it says Mplus FIML is only for continuous variables. Is that true?

Thanks for the reply, Linda. Although my data are longitudinal (two waves), there are no repeated measures. Wave 1 data are respondents' reporting of their parents socioeconomic variables and wave 4 data are respondents' socioeconomic variables. I want to use latent class analysis to capture intergenerational mobility. And I want to identify individuals into different class membership. For example, I want to classify people into different groups, like moving up, staying the same as their parents, or moving down. For this kind of model, can I do simple latent class analysis (treating the longitudinal data as cross-sectional data) instead of LCGA or GMM?

BTW, using Full Information Maximum Likelihood to deal with missing data, do I need to specify it in the code?

I am trying to analyse clinical + genetic data from a patient cohort as part of my PhD. I have started using LCA (LatentGolD) to classify any underlying latent classes within my data, however after reading the manuals and a few tutorials, I am still confused as to how to determine the best cluster model. Some places I have noticed they just opt for the lowest BIC, however in other places they select the lowest L2 value. Is there any set criteria to select the best model?

FIML has come to mean ML under the MAR assumption of missing data. It is true that the term is typically used with continuous outcomes, but ML under MAR can be used with categorical outcomes as well, not just continuous ones. It is the default in the current Mplus version. You obtain ML under MAR simply by specifying what missing data symbol you have in the data (Missing = in the Variable command). By requesting Patterns in the Output command you will see what missingness there is in your data.

It sounds like you want to do Latent Transition Analysis. This is a Latent Class Analysis at several time points where you can study changes in class membership over time. The User's Guide has several such examples and there are several papers posted on our web site on this topic.

Hi Dr. Muthen, I ran a simple regression in Mplus and SPSS. The valid cases in SPSS with listwise deletion is 409, while the number of observation in Mplus is only 290, together with this: *** WARNING Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 190

I checked the way I read data and I did not find any problem. Looking forward to your suggestions. Thanks!

That message suggests that you did not do listwise deletion in Mplus. That message is related to TYPE=MISSING where missing data theory does not apply to independent variables. To do listwise deletion in Mplus, specify LISTWISE=ON in the DATA command.

It turns out that the total N is 712 but Mplus only used 480 observations. the variables in the model do not have missing data and there are 712 rows in the datafile. Is there anyway to find out which part of data are used in mplus? Thanks!

Hi Dr. Muthen, I have checked the data used in the analysis and the original data. it looks like Mplus deleted some observations not for missing data problem (some observations without missing data were also deleted). Is there any reason why Mplus would delete observations from analysis? Thanks!

Hi, I just upgrade Mplus to 6.1 and i run an old program and the number of subjects is now lower.

I want to estimate a regression model using FIML. But now i revice the following message:

*** WARNING Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 29 1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS

In Version 5, the default changed from listwise deletion to TYPE=MISSING. You can obtain listwise deletion by adding LISTWISE=ON; to the DATA command.

Missing data theory does not apply to observed exogenous covariates. That is why observations with missing on x are excluded. If you want them included, you must mention their variances in the MODEL command. When you do that they are treated as dependent variables and distributional assumptions are made about them.

You can't identify whether the data are missing as MAR or NMAR, the two key contenders. For how to approach this dilemma, see the 2 papers on missing data by Enders and Muthen et al. mentioned on our home page, which also show how to do NMAR modeling in Mplus.

Hello, I have a question regarding a discrepancy between the estimated sample statistics produced by the descriptives produced for example 11.2, and the estimated sample statistics provided for a LGM. For the outcome trajectory, the estimated means for the baseline observation are the same, but diverge for later observations. Specifically, the estimated means are substantially lower for the LGM produced SAMPLE STATISTICS: ESTIMATED SAMPLE STATISTICS at later observations than the RESULTS FOR BASIC ANALYSIS: ESTIMATED SAMPLE STATISTICS.

I am a doctoral student attempting to test second-order factors that I will include in a future SEM analysis. I am using complex survey data which require weights. My dataset does have two different versions of the weights – a base weight with a Taylor Series strata and PSU, or replicate weights. When I was proposing this project, I was advised by my mentor to use FIML in Mplus to address missing data. I have been using the replicate weights with bootstrap standard errors, but it appears that FIML is not available with this method of weighting.

Am I correct in my understanding that FIML is not available when using replicate weights with bootstrap standard errors?

In this case, what approach do you recommend? If at all possible, I would like to avoid listwise deletion.

can FIML be used when WLSMV is specified as estimator and is this accomplished by type = missing command? An earlier reply post by Linda to another user's question indicated that "if you use TYPE=MISSING with WLSMV, the missing data technique is pairwise present." Thanks!

I am running a mediation (mediator is latent continuous) model with four latent factors, and predominantly categorical indicators. The outcome is ordered categorical with six levels. Would it be advisable to treat the outcome as continuous in this case (rather than specifying it as CATEGORICAL) in order to reduce the computation burden of the numerical integration that will be required for this model?

It depends on whether the ordinal variable piles up at either end, that is, has a floor or ceiling effect. If it does, it should be treated as a categorical variable. If not, you are probably safe to treat it as a continuous variable. You can also consider using the WLSMV estimator. If you have categorical factor indicators, each factor is one dimension of integration with maximum likelihood estimation.

I realize that beginning with v. 5 Mplus uses missing and Type=H1 as the default in model analyses. However, I was curious as to why the same exact model run (same command-line syntax) would indicate different missingness (on the same input data file) between versions 4.1 and 6.0. This was noted when an adviser ran the same analyses on a different version than I am using (the estimates are basically the same, but in the analyses using 4.1, all the data is indicated as being present whereas in V. 6 it indicates that there are 2 cases with data missing on x-variables and 150 cases where data is missing on all variables except x variables). I know from a previous analyses using FIML in v 4.1 that the warning for missing data is worded such that missing data are noted as 'number of cases with missing on all variables'. Is this difference b/c v 5 and higher looks at missingness as a function of x and y variables whereas v 4.1 looks at it with respect to all variables considered simultaneously?

Thanks Linda, read through this. To make sure I am understanding the difference fully would it be correct to say that pre v6.0 cases were only deleted if they were missing values for all variables (i.e., endogenous and exogenous) whereas currently cases are deleted if they are missing either all x vars, all y vars, and/or both considered in total?

Many thanks. One thing I am still having trouble wrapping my head around is that when I conduct parallel analyses in V6, my results are exactly the same as they were in v4. The only difference is that in v6 I get the warning that data set contains cases with missing on all variables except x-variables. These cases were not included in the analysis.

With those cases not included in the analysis how is it that the model results can still be exactly the same? Does it have something to do with the fact that I complete data for the x-variable in the model? Bear with me if the answer is straightforward and I am just not seeing it.

It is the case that for maximum likelihood estimation for continuous outcomes with no missing data, the results will be the same if the model is estimated with y and x or with y conditioned on x. It is only in this case that the results will be the same. We changed to y conditioned on x to be in line with the rest of the program and regression in general.

Are there sample size parameters for conducting a pattern mixture growth model? I am writing a data analysis section for a grant in which missing data that is not ignorable is expected. It is a small clinical trial with a total of 60 subjects with equal distribution in two treatment groups (n=30 in each). I know this is very small, but was wondering if testing a model with two growth classes (one with missing one without) would be possible?

That depends on how many time points you have and what the growth shape is, plus parameter values. You have to do a Monte Carlo simulation study to learn about it. 60 may not be too low even for a 2-class model, but with mixtures the answer also depends on the degree of class separation in the growth factor means. See UG chapter 12.

Version 6.1. See Version History on our web site to find more information about this.

You can easily revert to before v6.1 by mentioning the means or variances of the covariates. But this then makes additional model assumptions, not included in the original model. They are the same assumptions you make with multiple imputation. We made this change to be consistent throughout the program with categorical modeling, mixture modeling, and other cases.

Yes, at least the way I define FIML. FIML is helpful when some endogenous variables have some missing values because then missingness is allowed to be a function of some of the other, not missing, variables for the individuals with missing. For instance, in a longitudinal study, the outcome at the first time point may be observed for many persons and this may predict later missingness.

On p. 458 of the version 6 Mplus manual, it says, "The ASCII files...must be created by the user." I noticed yesterday, though, that I did not need to manually create these. When did Mplus start doing this automatically?

Hypothetically, in a Monte Carlo simulation study where a design cell contains 1,000 replications of a CFA model where multiple imputation was used (with 10 imputation data sets created per replication), would you simply average all of the parameter estimates and fit statistics (including chi-square) for that one cell across all imputations within replications?

You can do that for the parameter estimates but not for the fit statistics. How to accumulate fit statistic information is unstudied except for ML chi-square. See the most recent Topic 9 course handout for the formula.

Is the ML chi-square over 5 imputations, for example, output anywhere for reading into an outside statistics software package? Or will I have to calculate the ML chi-square over the number of imputations in multiple imputation?

Note that the ML chi-square for each imputation - or the average of this over replications - is not a useful measure of fit but misestimates fit quite a bit. See the Topic 9 handout of 6/1/11, slides 212-216 for a study of this. The correct chi-square T_imp is printed.

Would it be reasonable/appropriate to do a Monte Carlo simulation study to see how multiple imputation works with the WSLSMV estimator? Mplus is capable of this, right? We just don't know how it will act?

Am sorry if not posting in the right place, could not find an appropiate topic. I keep getting this error message when running my input file.

*** ERROR The length of the data field exceeds the 40-character limit for free-formatted data. Error at record #: 1, field #: 32 *** ERROR The number of observations is 0. Check your data and format statement. Data file: F:\MATCHEDBYCHNR W1234_mplus.csv

I have saved the spss file as .csv without variable names so everything should be alright.

Furthermore I was wondering, i am using type=complex and have missing data, is FIML used automatically? also i have indirect effects, bootstrap cannot be used i have found out is there any other option to know whether the indirect effects are significant? (i've heard about Prodscal but am not sure how that works)

I am interested in using MPlus with large data sets like TIMSS and NAPE, in particular TIMSS. To do so I have a few questions:

1) Does Mplus allow for sample weight to items if yes how?

2)How Mplus handles missing data in blocks to explain it in detail:

TIMSS is a collection of 12 booklets that is administered to several thousands students. Each student answers only to 2 booklets, as the result, if one wants to stduy the whole data set, one will have blocks of missing data. I was wondering if Mplus can handle such data sets. TIMSS 2003: 740 students are responding to items in block 1 and 2, another 740 students respond to block 2 and 3, another 740 respond to blocks 3 and 4 and so on. I was wondering if I stack all these items then there will be blocks of items which are missing. Can Mplus handle such a data?

3) I will be using it for Latent class analysis and was wondering if I could fix some of the parameters for the purpose of equating these blocks of itmes

2 )You have missing by design which can be handled in Mplus in two ways. One is to have as variables (columns in the data) all the variables in all 12 booklets so that all students have missing on most variables. The other is to do a multiple-group analysis where each group of students has its own set of variables (but the same number of variables).

thank you for your reply and sorry for the double posting. just to clarify, these variables are all categorical not continuous. Basically correct (1) or wrong(0) answers to mathematics questions. Can Mplus work with missing data, about 50% of the data is missing?

I have output for two mediation models now. However, the robust chi square values are not computed. (with MLR AND TYPE=COMPLEX) Also, no standard errors are displayed only the estimates. (i use the STANDARDIZED CINTERVAL command in OUTPUT)

I am receiving the error "THE MINIMUM COVARIANCE COVERAGE WAS NOT FULFILLED FOR ALL GROUPS" when running a structural equation model using complex survey data and WLSMV (n=1,100, with 46 observed variables). There are several latent variables and a number of observed dummy variables as covariates. Everything is being regressed on a dichotomous variable.

When I run the analysis variables through type=BASIC to investigate covariance coverage, it appears that all coverage values are well above 0.9. I wonder what I should be looking for. This is not a multiple group analysis so I wonder if there are other groups that the error message would be referring to? Perhaps it is referring to the clusters in the complex survey data?

Drs. Muthen: I have seen the use of latent coefficients (latent "placeholders," if you will) in longitudinal growth models where there is planned missing data.

Can latent coefficients (or latent "placeholders") also be used in the context of multi-group modeling when not all items from a standard scale were administered to one group?

What about if the item was not administered to either group? Would we be able to use latent placeholders and have Mplus estimate what the factor loadings for those items would have been had they been administered?

Or is this an inappropriate use of the latent coefficients / latent placeholders?

Do you have an example in the Mplus manual, or do you know of an article, that has used latent placeholders in the context of multigroup modeling with planned missing data in the past?

If a study has planned missingness, a missing value flag should be assigned to individuals who did not take the item. Nothing more needs to be done. If no one took an item, it will not be used in the analysis. An item with all missing contributes nothing to the model.

You cannot identify or estimate a loading for an item that was not administered to anyone in the sample because there is no sample information on this loading. You have to have some subjects who has responses on the item so that you know how this item correlates with the other items - and therefore can draw inference about how subjects who didn't take the item might have scored.

In a two-group model you can have an item that is only administered in one of the two groups, but you cannot estimate a group-specific loading for that item in the group that didn't take the item. This is for the same reason as above.

I have not heard of latent coefficients / latent placeholders, so I don't know what that is.

Bengt (or Linda): You mentioned that "in a two-group model you can have an item that is only administered in one of the two groups."

This is what I am doing, and the items are categorical, in a 2-group (gender) CFA model. One such item that was administered to only one group (girls only) is CBCEYEP (measuring eye problems as a somatic symptom).

When I run the analysis, I get the message: *** ERROR Categorical variable CBCEYEP contains less than 2 categories.

However, that is not true. Among girls, to whom the item was administered, all three possible categories were endorsed. I set loadings and thresholds to be equal in both groups, and freed the scaling factor in one group.

We have extra variables in certain groups (and missing in the other group), and these variables are dependent. Fixing the residual variances to be equal to those in the other group implies Theta parameterization, does it not?

The trick in that FAQ is only for continuous outcomes, not categorical outcomes that you have. Here is another approach you can take.

Assume as an example that you have 10 items that both males and females respond to and assume that each gender also responds to 5 additional items, but they aren't the same for the two genders. So each gender responds to 15 items. Your input should then refer to 15 items in the USEV list and your model should have any equality constraints applied only to the same 10 items that both genders respond to.

You don't want to list 20 items because both groups would then have 5 items where nobody in the group has a responses to those 5 items.

The 15 items are different items for the two groups. For males, it is the set of 15 items that males responded to, for females it is the set of items that females responded to. So you have to arrange your data that way.

Bengt, I am trying to run this as a multi-group model, where parameters for males and estimates for females are estimated simultaneously.

Do I estimate this in three stages? For example, get the estimates for the model with items that females responded to, then get the estimates for the model with items that males responded to, then run the overall model with parameters set at the values derived from the first two models for the times that these are missing for a certain gender in the overall model?

I am sorry. I am so confused about this. I think the three-stage strategy may work because Linda mentioned that groups with no data for an item do not contribute to the estimates? However, if I have for example a factor with 8 indicators but two are missing for boys and two are missing for girls, and I try this three-stage stratgegy, can I assume that including the loadings for these items administered only to one of the two groups would have no effect on the other loadings when I add them in?

What I suggested is a single analysis, not a multi-stage analysis. I suggested a simultaneous, 2-group analysis of males and females. You arrange your data so that say the first 10 columns are the common items and the next 5 columns are the items specific to each gender (so those 5 are different items for the two genders). So for instance if you have one factor f, you say in the Overall part of the model:

f by y1-y15;

You can then apply measurement invariance across gender for the first 10 items. The next 5 items contribute to the measurement of f, although they are different items for the two genders.

This is a standard type of approach when different sets of subjects take different forms of an achievement test. A similar approach is also used with multiple-cohort data.

If you are still unsure of what I am suggesting, you may want to consult with an SEM person on your campus who can sit down with you and talk you through it.

Hello, what does the following message imply about my data, and how can I fix the problem so that the model will run? I don't think one entire group is missing data for these items, so I am not sure why I am getting these messages.

WARNING: THE BIVARIATE TABLE OF VANDA_D AND SKINP_D HAS AN EMPTY CELL.

COMPUTATIONAL PROBLEMS ESTIMATING THE CORRELATION FOR VANDA_D AND SKINP_D.

P.S. I do have several items with low endorsement, such as the items below, which were mentioned in the warning above. Can items with low endorsement lead to the generation of a warning like the messages above? The data are not really missing, so it's a confusing message to receive.

When a bivariate table has an empty cell, this implies a correlation of one which means that only one of the variables should be used in the analysis. Variables that correlate one are not statistically distinguishable. Empty cells can occur for extreme items when sample size is small.

I am trying to run what I thought was a very simple model using version 6.1, predicting wave 2 self esteem (continuous) from sex (categorical), wave 1 self esteem (continuous), and authoritative parenting (continuous). The problem is that I'm getting listwise deletion of all cases with missing on x-variables! Here are the highlights:

This is because missing data theory does not apply to observed exogenous variables. To avoid this, you would need to bring all of the covariates into the model by mentioning their variances in the MODEL command. When you do this, they are treated as dependent variables and distributional assumptions are made about them.

Data set of N=800, 5 measurement points (MP), first MP has 5%, last MP 40% missings on my one continuous outcome variable. Covariates (6 time invariant, 1 time varying) also have some missings. If I run the whole growth mixture model, MPLUS deletes about 50% of my subjects, which is an insane amount of information I do not want to lose:

"Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 396

What to do?

(1) Auxiliary isn't possible in type RANDOM or MISSING if I see that correctly.

(2) I watched your videos 3-6, but the parts about missing data confused me more than they helped me ;). I read up on "Diggle Kenward selection Modeling" and "Roy's Model (Pattern Mixture Modeling)" but I don't want to write a paper on missing data and imputation. Other people must struggle with this also. Are there any guidelines I can follow on this?

(3) Chapter 11 of your wonderful manual: "Covariate missingness can be modeled if the covariates are brought into the model and distributional assumptions such as normality are made about them." What does this mean - how do I model covariate missingness exactly?

I would choose number 3. What this means is that in regression the model is estimated conditioned on the covariates and no distributional assumptions are made about them. If you bring them into the model and treat them as dependent variables, distributional assumptions are made about them.

I am trying to run a mixed-effects meta-analysis using SEM, with 4 dummy coded moderator variables as fixed effects and a random effect for the intercept. Several studies are missing data on one or more moderator variables.

When I run the model with TYPE=RANDOM, I get a warning indicating that listwise deletion was done and only 11 of my 22 cases were included in the model. However, when I run the model with both the intercept and moderators as fixed, I do not get this warning and all 22 cases are included.

Why is this and what do I need to do to use FIML for the mixed-effects analysis?

I am using Mplus Version 6.1 and am using ML w/monte carlo integration, and have been unable to get fit statistics. I thought I read that beginning w/version 3, this would be possible? Do you have any suggestions?

Hi. I’m trying to unpack the defaults in Mplus (5.21) Re: the way it “adjusts” for observed control covariates, across different estimators. I have a model: Latent Y regressed on latent X1 and a set of observed control covariates. I use the MLR estimator. It looks like the default is to give estimates of the covariances between the observed and latent covariates. My understanding is that one doesn’t need to “call in” the covariances amongst the observed covariates, in order to make sure that the fitted regression parameters control for the other variables in the model. If I, say, use numerical integration here instead, it looks like the default is to NOT estimate covariances between the observed and latent covariates. Is it still the case that the regression parameters are adjusted for the effects of the covariates in the model (observed or latent)? I ask b/c I fit the same model –w/ MLR and then w/numerical integration—I get notably different estimates for the effects of my observed covariates. W/ MLR, it looks like it might be adjusting for the other covariates, whereas w/ numerical integration it does not appear to be. If I “call in” the covariances between the latent and observed covariates w/ numerical integration, it looks like the MLR model w/out numerical integration. I suppose it could also be a difference in the way the two models handle missing data (?). Thanks for any thoughts.

In one of your old posts (above), you mentioned "For censored and categorical outcomes using weighted least squares estimation, missingness is allowed to be a function of the observed covariates but not the observed outcomes." I was just wondering if you have a reference paper for this so that I can read more details. Thanks a lot.

I am new to MPLUS and have version 6.12. I am running a simple CFA with one factor and 21 ordinal outcome variables. But, I have missing data (assuming MAR). I am a little confused as I have read different things in terms of whether the program 'handles' the missing data when missing is indicated. I have gotten the program to run and get fit indices, but am worried that the estimator being used isn't appropriate. (WLSMV). Is this ok?

I am having trouble reproducing residual variances from the estimated parameters of path analysis model with missing data.

I ran path models with and without missing variables. When I manually recalculated the residual variance using estimated parameters, I could only reproduce mplus residual variances when I have full data. I took out the cases with missing variables and recalculated again but my calculation still did not match with the mplus estimate of residual variance.

Could you please help me understand what is the problem. I built the model as single indicator factor analysis model. Thank you.

I don't see how you can manually calculate the residual variance - it is estimated by ML I assume? You don't say if your variables are continuous or categorical, where in the latter case residual variances are not free parameters.

I just recalculated as the variance of difference between estimated dependent variable Y' and actual Y. Y' was calculated as Y'=intercept+coef*X. X and Y are observed, so I used the parameters (coef and intercept) produced by mplus to reproduce residual variance in a spreadsheet. Please let me know if I have misunderstood. I was able to reproduce the residual variance for full data but not for missing data.

You are assuming that estimated variance of Y equals the sample variance of Y. This is not true for all models. If you look at your missing data run, requesting Residual in the Output command, you will probably see that the difference between estimated and observed variance is not zero.

I am fitting a latent variable model that also includes random effects. Some of the items are binary while the rest are continuous. My data includes missing values as well. The output that I get says that the dimension of numerical integration is 1 and it is actually very fast. I was wondering how the fitting takes place with only one dimension of numerical integration although I have seven latent variables and random effects? Is there a dimension reduction technique used by MPlus?

I have 4 u's (random effects) and 3 z's (latent variables) and the output below specifies that the dimension of numerical integration is only 3. I am just wondering why the dimension is only 3 and how the other latent variables are integrated.

Another question I have is about dealing with survival data. Do we have to specify that some indicators are survival indicators? Or is it sufficient to have the survival items set up as in (Muthen and Masyn, 2005)and then MPlus will automatically model the conditional probabilities of survival (hazard function)rather than just a logistic model?

Hello, I would like to determine the proportion of missing data in my data set. Under the DATA MISSING command, would I just ask for DESCRIPTIVES for the variables that I named as missing? Are frequencies provided as part of the DESCRIPTIVES output?

I am conducting SEM analysis using data from cross-sectional study. Missing data analysis for some variables showed data are missing not at random (MNAR). I am wondering whether MLR will be able to handle data which are missing not at random? If not, will it help if I impute missing values using SPSS for these variables before I conduct the analysis using MPLUS?

You want to make a distinction between MCAR, MAR, and NMAR (=MNAR). See missing data books, or our Topic 4 teaching. Mplus does MAR by default (often called FIML) and can also do NMAR modeling. Mplus also does multiple imputation. Multiple imputation assumes MAR.

What you are probably seeing is that MCAR does not hold. MAR may still hold. There is no way of knowing if MAR or NMAR holds.

Hi there, i'm doing a latent class growth analysis across 5 time points (baseline, 3, 6, 9, and 12 months) with missing data. In reading varoius literature, I understand I should use the "auxiliary" function to ensure my data is MAR. So, I've pasted my syntax below to ensure it is correct because my output doesn't seem to change with or without the auxiliary function (note, the sample size is 269 patients).

In response to one of your previous comments "FIML is an estimator and EM is one algorithm for computing FIML estimates. Other algorithms include Quasi-Newton, Fisher Scoring, and Newton-Raphson. Mplus uses the EM algorithm for the unrestricted H1 models and the other algorithms for H0 models. "

Aren't Quasi-Newton, Fisher Scoring and Newton-Raphson mathematical methods for finding a solution of a equation? How are they related to missing data? If using these algorithms for H0 models, is it true that the missing data were not taken into account for H0 models? Thanks.

I think I was trying to make a distinction between estimators and algorithms because it seems like sometimes missing data handling is referred to as using the "EM approach", which mixes apples and oranges. So the answer is Yes to your first question. Any of the algorithms can be used to do ML under MAR, which is often called FIML. So the answer to your second question is No - the use of these algorithms is unrelated to whether missing data is handled or not. You have to look to the assumptions made in the estimation to know how missingness is handled.

In response to one of your old comments "There is no paper that describes this. With WLSMV the dependent variables are looked at in pairs so missing data information cannot be gathered from all variables like in maximum likelihood."

So pairwise likelihood uses information from each pair of the observed endogenous variables. But for those who have only one observed endogenous variable, there are no pairs available. Will those subjects be thrown away when using pairwise likelihood? Thanks.

Auxiliary variables are treated as continuous and should not be specified to be other than that. Using a nominal variable as an auxiliary variable would not work. You may want to create a set of dummy variables.

I am having some trouble with coding in upgraded Mplus. My old version of Mplus was Version 4. If I ran a continuous growth curve in 4 I had to specify the TYPE=Missing option but once I did that it would handle both missingness on my observed Y's (from attrition) and missingness on my X variables. Now in the new version (6) TYPE=Missing is the default but my model is "kicking out" anyone with missingness on any X variable. It only used to do that when I modeled a noncontunuous Y variable. Is this some problem in my code or did the default FIML change such that it automatically drops cases with missing on the X variables?

I am trying to run through the steps of factorial invariance and ultimately run from these latent constructs a growth curve over 4 time points (continuous data).

I have imputed my missing data (resulting in 20 data sets) using the latest version of Amelia and created the list.dat file (as is done in example 11.5 of the user guide). However, while I no longer have missing data, I do have some variables (mean scores) that will be ultimately included in my model that are non-normal (not seriously however).

I would like to use an appropriate estimator that will allow me to use the TYPE = imputation command to summarize results and provide me with the information I require to examine the CFI and RMSEA CI to judge my model as I go through the steps of factorial invariance.

From what I can tell with my first attempt, using the MLR estimator (not the MLM as it seems its not available with TYPE=imputation) I cannot access the CI for the RMSEA (as it is provided if using the ML estimator with TYPE = imputation).

My question is how robust the ML estimator is with nonnormal data and if it would be appropriate to use this (knowing I have some nonnormality) so that I can get the fit information I require using the TYPE= imputation command.

With multiple imputation, the only fit statistic that has been developed for multiple imputation is chi-square for ML. For the others, averages are given. I would run ML and MLR and see how different the standard errors are. If they are not that different, it would indicate that you variables all not that non-normal. I would then use ML.

I have 13 groups, and am testing a five item CFA (but 150 items in the data set). The sample sizes shown in the "Number of observations" section of the result, is 20 to 30% less than real sample sizes. The only explanation that I can think of is that listwise deletion has been applied to all items menioned in "names are" part. Can you think of any other explanation? If not, I will talk to my supervisor to send the files. Thanks.

Look at the warning messages that are printed for possible reasons. Check that you are reading the data correctly. You may have blanks in the data set that are not allowed with free format. Check that the number of variable names in the NAMES list is the same as the number of columns in the data set.

I run a TWOLEVEL model. N is 72418. MPlus6.12 drops 42258 cases due to missingness on the x-variables. I have used SPSS to check for the total number of cases with at least a missing value and it is two times lower: 20765.

If I run a model with no predictor (just the dependent variable), there is no difference in the number of dropped cases reported by MPlus as compared to the one that I compute in SPSS. The more predictors I add, the higher the loss of cases when using MPlus (as compared to the value that I compute in SPSS).

Since I am not very experienced with MPlus, it is probably something that I miss, but I have no clue what this should be. Any suggestion would be more than welcome!

I have data that is neither MAR nor missing by design - we are measuring sexual dysfunctions as part of our analysis, and people that did not engage in sexual activity were unable to answer many of the questions, so have system missing values. We are using mixture models to analyse the relationships between sexual dysfunctions, depression and anxiety disorders.

I was wondering whether using FIML would be an acceptable way to deal with these cases, or if you have any other advice?

I am preparing data for an MSEM path model that will be examined in Mplus for my dissertation. The extent of missing data less than 5%. I am considering two options for handling the missing data.

1) Impute missing data using EM before running the model. This would allow me to retain available data for summary scores.

2) Run the model in Mplus using FIML estimation of missing data. This is a more accurate estimation but would be based on less information, because the summary scores would be missing for any case missing data on any item in the measure. The extent of missing data would also be greater because of this handling of the missing data.

What might be the advantages and disadvantages of using EM vs. FIML in this instance?

Imputation and FIML should give quite similar results. Typically, if you can do FIML that gives you more options for various tests. Note that Mplus does imputation. I don't see how there would be a greater extent of missing data with FIML than imputation.

Regarding the summary scores, why not use the average value of the items that are not missing. The imputed values don't carry new information anyway.

I m working on SEM with caegorical variables. In the output of my results, I have this warning Data set contains cases with missing on x-variables. These cases were not included in the analysis. Number of cases with missing on x-variables: 105

The number of my observation is 2388.

What is the methode, can you suggest to me for dealing with missing: liswise deletion or pairwise deletion. Some authors propose the maximum liklihood estimation for incomplete data. But this option doesn't work in my case because my Analysis is type=complex.

"If a study has planned missingness, a missing value flag should be assigned to individuals who did not take the item. Nothing more needs to be done."

I am working on data with skip patterns such that there are some variables that are responded by only a subset of participants. I have two questions:

1) When using multiple imputation, can I impute such that only those who should have responded but refused are imputed? I don't want values for those who are valid skips. If I set valid skips to missing, they will be imputed. If I do not set to missing but assign them values, the program will use those values as valid responses while imputing missing data. What should I do?

2) The model I am working on includes these variables that are responded by only a sub-sample. When modeling these variables, is there anything that needs to be done? Or do I just use them just like any other variable in the model?

If the missing data are not specified in the command, how would they be treated by Mplus? In my data set, my missing data were blank; I forgot to specify missing is blank in the first place, but still got an output with no error message. I'm curious how those missing data were treated.

Hi there - I am conducting a longitudinal CFA with one factor, four non-normal continuous indicators and two time points, pre-intervention and post-intervention. I have about 650 cases. I am conducting tests of measurement invariance (partial strict invariance is supported) and the aim of the study is to determine the effect of the intervention on the latent mean (so the reduction in the latent mean from pre to post, which is significant). I am able to run the analyses no problem, but the problem is that about 50% of the post-intervention data is missing from drop out. I would rather use MLR and all cases rather than perform listwise deletion, but I'm not sure what the impact of such a large amount of missing data would have. Any help would be really appreciated.

The Mplus default is MAR using all cases, which is obtained when requesting either ML or MLR. But you are right that 50% attrition is a lot and that means that the results depend to an uncomfortably large extent on the model assumptions, including normality. It can be particularly problematic if the missingness rate is different for the intervention groups. I assume it is not possible to try to find a random sample of those who were lost to follow-up.

The intervention is done online in an open access research setting, so we have no control group just the intervention and no contact with the patient once they drop out. The indicator variables are very much left censored.

Because your mediator is categorical you have to pay special attention to how to treat the mediator in the modeling. Call it u and let u* be an underlying continuous latent response variable for u. The key question is if u or u* is the predictor (IV) for the distal outcome y. ML uses u which complicates matters. WLSMV uses u*. Bayes can use either. More correct causal effects are obtained as in the paper on our website:

A simple approach is to use WLSMV and include the x variables in the model by mentioning their means or variances. This also enables Model Indirect. You can also do Multiple Imputations as a first step to handle the missingness on the x's.

This means that income has a variance so large that it will not fit in the space allocated. We recommend keeping the variances of continuous variables between one and ten. You can rescale variables using the DEFINE command by dividing them by a constant so that their variances are between one and ten, for example,

I'm trying to calculate the percentage of missing data in my data set. If I specify "missing = all(-999)", for example, is there a command I can use to determine the frequency with which the value "-999" is observed? Thanks.

Dear Prof. Muthen, I am running an analyses with MLR. If I understand correctly the default is that Mplus deletes cases with missing values on exogenous observed variables and uses full information for missing values on the endogenous variables? What are the advantages of using this approach? Do you have a reference to read more about this? Thank you very much in advance,

Hi, I wanted to test the missing pattern of my dataset based on Little’s MCAR test (Schlomer, Bauman, & Card, 2010) using Mplus. I checked out posts on the Mplus forum and it looks like we have to use "type=mixture" to obtain Little's MCAR test and the variables has to be categorical. However, I doubt I understand it well. There has to be a general command for testing MCAR and MNAR for imputation in a general regression model (not complex mixture model). May I have your advice on how to conduct this MCAR test with a chi-square value in Mplus? What is the command syntax for this test in continuous variables and simple regression models? Thank you very much!

In particular MAR v.s NMAR testing is conditional on assumptions about the missing data mechanism. So I would say it is somewhat limited (that has nothing to do with which software you use - it comes from the fact that the MAR hypothesis is very very general - it is hard to test against any ignorable missing data mechanism).

Dear MPlus developers, I'm trying to understand the exact algorithm that you use for dealing with missing data through maximum likelihood. Reading classical papers on the topic, I thought that there exists a closed form for the maximization of the full information maximum likelihood problem with missing data only when the outcomes can be considered multivariate normal, while in all the other cases, so for example with categorical outcomes, we need iterative methods like the EM algorithm. Is this what MPlus does? Or am I wrong? Sorry for bothering you, but I didn't find this information anywhere, Thanks in advance.

Dear Tihomir, thank you very much for your answer! I had already seen Appendix 6, but I didn't find what I was looking for and also I thought it was a little bit out of date, since it starts by saying "Missing Data is allowed for in cases where all y variables are continuous and normally distributed", while I read in the general description of modelling missing data that "MPlus provides ML estimation under MCAR (missing completely at random), MAR (missing at random), and NMAR (not missing at random) for continuous, censored, binary, ordered categorical (ordinal), unordered categorical (nominal), counts, or combinations of these variable types (Little & Rubin, 2002)." That is also what I'm mostly interested in. I will give a look to the second reference you gave me, Thank you very much again for your answer!

In SEM with the criterion variable having categorical indicators with missing values (the predictor variables have continuous indicators with missing values), can I use FIML? Thank you very much in advance!!

Hi, I’m running a model with ML estimation. By default Mplus deletes cases with missing values on exogenous variables and uses full information for missing values on the endogenous variables. The result is, however, that I’m still missing 33 cases due to missings on five exogenous variables. On this forum, I read that Mplus can handle missings on x-variables if they are brought into the model as y-variables, for instance, by mentioning the variance of the variable in the model command. But I’m not sure what the effect of this is and what I’m exactly doing. Isn’t it a bit artificial? So I’m wondering what the normal/standard procedure is for handle missings. Do you follow the default and accept that you have 33 missings due to missings on x-variables. Or do you do some tricks/adjustments to make your x-variables look like y-variables and consequently no cases are deleted? If the latter option is the standard/best approach, could you please tell me what I precisely should add to my model command given the fact I have five exogenous variables of which three are dichotomous and two are continuous. Is it just as simple as adding “X1 X2 X3 X4 X5;” into the syntax? Thank you very much in advance.

(1) Usually in multiple imputation normality is assumed for the variables with missing. This approach is also taken if you bring the x variables into the model. (2) Or, you can use multiple imputation and specify that a variable is say categorical and then an underlying continuous-normal latent response variable is assumed. Both approach (1) and (2) therefore make assumptions. Treating your binary x's as continuous-normal in approach (1) is only an approximation and taking approach (2) may also be only an approximation. So in both cases you go beyond the assumption of the model you originally specified for the y's as a function of the x's. Approach (1) is probably very often taken. You mention 33 missings, but more relevant is probably the percentage that this corresponds to - if it is small the analysis doesn't rely on assumptions as much as if it is large.

I am analyzing some longitudinal data in a cross-lag model (N = 250)and have 30% of subjects missing data at T1, 26% at T2 and 38% at T3. Relatedly, 36% have data at all three time points, 34% at 2 of 3 and, 30% at only 1 of 3.

Based on examining correlations within the sample, these appear to be MAR and we have included correlated variables in the analysis as auxiliary variables to help with estimation and reduce bias.

We submitted the paper and both the handling editor and 1 of the reviewers expressed concern regarding the % missing. Do you know of any references that give guidelines on acceptable levels of missingness? Or do you have a personal rule of thumb? We have cited McArdle et al 2004 & Enders & Bandalos, 2001 on the use of FIML to address missingness. And Enders 2010 on the use of auxiliary variables. Any other suggestions? thank you very much for your ideas/suggestions.

I would be most concerned about how many observations are present at two of the tree time points. I would also compare my results to those of listwise deletion. You can also create dummy variables for missing for times two and three and regress them on the time 1 outcome to see if y1 predicts missingness.

I don't know of any discussion of how much missingness is too much. The Enders book is the most likely source.

We are conducting a randomized control trial and we are doing multilevel modeling to determine program effect. According to the What Works Clearinghouse (WWC), when dealing with missing data in our analyses we need to do so separately for the treatment and control groups. Can we use FIML separately for the treatment and control groups?

The other alternative that WWC accepts is multiple imputation but this also needs to be done separately for treatment and control groups. Is there a way to do this within Mplus?

Dr. Muthen, I am running a 3-step GMM model with 5-wave scale scores as my outcome variables and a few independent variables to predict the class membership (e.g., race and LGS scores). Three participants missed to indicate their race and one missed on LGS. To keep them in the step 3 analysis, here is the model syntax: Model: %OVERALL% i s | pl_1@0pl_2@1pl_3@2pl_4@3pl_5@4; i WITH s; c on AA sex LGS z_t1_c z_t1_n z_t1_e highschool somecoll college; [AA LGS]; ... The analysis did include all the 163 participants. However, the AIC, BIC, and ABIC are much larger than the model without estimating AA and LGS. I wonder what your advices would be given this situation. Thank you.

I'm running a latent growth model in which the latent intrinsic work rewards variable at each of the six waves was specified by four items, and then intercept and slope were estimated using the six latent variables. Finally, I used the intercept and slope to predict generativity at the final wave along with some control variables.

My concern is that there are two types of missing data here: those who did not participate in a given wave (missing at random) and those who participated but did not answer intrinsic work rewards questions because they were unemployed at the time (missing not at random). I'm fine with having FIML estimate for those who missed the wave, but not comfortable estimating for those who participated the wave but were unemployed. Would it be possible for you to give me some suggestions on how to restructure my model to account for this missing not at random? Would it be appropriate to add six dummy variables, one for each wave's employment status, when predicting generativity to address the concern of unemployment? Or should I revise in some way the longitudinal CFA model in the earlier step? Thank you very much.

It's a good research question that I don't think I know the answer to. I wonder what it would be like if you use a parallel process growth model where you have one binary part of employed/unemployed and one continuous part with an intrinsic work reward score. The latter is missing when the former is in the unemployed status. Which would mean missing as a function of an observed variable so could be MAR. At least the missing would not only be a function of other intrinsic work reward scores, but directly a function of employment.

I am planning on running path analysis models involving examination of direct and indirect effects on a sample (N = 159) with data at 3 different time points. My variables of interest are scale scores (means of multiple items from questionnaires). The sample has both item-level missingness (1 or more items of a scale missing) and scale-level missingness (entire scale missing) for both predictor and outcome variables and I am seeking advice on the best approach to deal with missingness. I had the following questions:

1) Should item-level missingness be dealt with using multiple imputation as a first step in software outside of Mplus? And should this then be followed by maximum likelihood estimation at the scale-level in Mplus? Or:

2) Should item-level and scale-level missingness be dealt with using multiple imputation outside of Mplus and the imputed "complete" dataset be used for subsequent analyses?

The optimal approach would seem to be to formulate a factor model for the item indicators for each factor and then simply use FIML (so assuming MAR). The practical problem arises from the one-factor models maybe not fitting the data well. Imputation could use a less restrictive model. But then again, the scales you mention probably are sums of items which in itself assumes a one-factor model.

Thanks for your quick reply! The scales mentioned are actually subscales from questionnaires with more than one factor.

Just to clarify, are you suggesting running a full SEM instead of a path analysis model to assist with item-level missingness using FIML? I am concerned doing so would be difficult given my sample size.

I am grad student working on a project with missing data in all most all variables. I have 5 latent variables, incl. 2 exogenous variable. I tried the syntax for missing data, but the error keeps asking me to add listwise=on and nochiquare in output. I did that as well.. but the error keeps coming back. Please help. I would not want the program to delete a complete data set for a few missing data points. Here is my syntax-

*** WARNING in ANALYSIS command Starting with Version 5, TYPE=MISSING is the default for all analyses. To obtain listwise deletion, use LISTWISE=ON in the DATA command. 1 WARNING(S) FOUND IN THE INPUT INSTRUCTIONS

If you want the model estimated using all available information, remove LISTWISE=ON; from the DATA command. The warning is just informing you that the default is using all available information. If you get an error when you remove LISTWISE=ON, send that output and your license number to support@statmodel.com.

Hello, I would like to conduct a model-based (H1) multiple imputation analysis but my model contains a latent variable interaction. I specified a Bayesian estimator but a message in the output file says that Bayesian estimation is not allowed with latent variable interaction and so the default estimator (ML) was used. However, I also see specifications for Bayesian estimation in the Summary of Analysis section. Will you please tell me whether the resulting imputed data were imputed using a Bayesian estimator or a ML estimator? Also, can I be sure that the imputation was done under my H1 model and not under H0? Thank you.

I am trying to conduct multiple imputation for specified variables prior to analyses for my MA Thesis. The problem I have encountered is that missing data that appear as 999 in my SPSS data file and in my .dat data file, which Mplus is reading, appear as asterisks in all of the imputed files. I checked, and each 999 in my original data sets appears as an asterisk in the imputed files. I conducted a cross-sectional version of the same multiple imputation and subsequent analyses last week, and did not encounter this problem. I copied and pasted the same input file for the current data sets. Could you help to guide me toward my error? My apologies for the basic question.

Likewise, I don't understand what you mean by "when you read the data." Do you mean when I run the input file for the analysis, or when I run the multiple imputation input file? In the latter, I specified MISSING = ALL (999); under VARIABLES.

I made sure that the order of variables in the dat file, VARIABLES list, and IMPUTE VARIABLES list are in the same order.

I just re-did the entire process, starting from my SPSS file. The same asterisks appear in my imputed files. I ran some simple multiple regression analyses to see what would happen. I specified IMPUTATION as TYPE and gave the correct list name to retrieve my imputed files. I got a full set of output, even though there are still asterisks in the imputed files and I did not specify MISSING in the regression input.I got a warning saying that the CHI-2 test could not be conducted perhaps due to a large amount of missing data. This is the only warning I received despite having the asterisks in the files.

I'm wondering why Mplus doesn't use cases with missing data on predictors with FIML? From what I've read, it's not possible to use cases with missing data on Y under FIML but still possible (albeit more difficult) to use cases with missing values on X but observed values on Y. Any light you can shed on this would be helpful. Thanks so much.

Missing data theory applies to dependent variables. Missing data theory does not apply to observed exogenous variables because the model is estimated conditioned on these variables. You can mention the variances of the observed exogenous variables in the MODEL command. This causes them to be treated as dependent variables and distributional assumptions are made about them but they will be used in the analysis.

I'm hoping this is a simple question and I'd like to be sure about it before I proceed. I'd like to bring covariates into the model so that FIML is used to handle missing on the covariates. Can you do this for binary covariates with missing values?

You can see this in the frequencies for the missing data patterns. The total sample minus those with no missing would be the number of observations with some missing values. For each variable or pairs of variables, see the coverage values.

I'm doing multiple imputation with Mplus and would like to know how to compute the standard deviation of a point estimate (the mean)from the standard error provided by Mplus. Could please give a reference? Thank you

I’m currently trying to run two level multilevel models for several binary outcomes using FIML estimation procedures with longitudinal complex sample data. The models are complex: the level-1 models typically having 10-20 binary IVs and the level-2 models for the intercepts having a maximum of 16 continuous IVs. Many of the IVs are completely observed.

Do I need to bring all of the x variables into the model in order to have observations having missing data for the x variables included? In a 6/22/2006 posting you note that “If only two or three of your covariates have missing data, then FIML should be fine. You should study the missing data in your covariates. Perhaps there are some with very little missing data such that you could allow the listwise deletion on those and bring the others into the model.” However, on 11/12/2014 you say that “if you want to bring one covariate into the model, you must bring all covariates in to the model. You cannot bring in just a subset.”

A small subset of the IVs account for most of my missing data. Is there a way to use the 6/2006 strategy and use listwise deletion for x variables missing small amounts of data – and not include x variables which don’t have any missing data in the model? Multiple imputation isn’t feasible for a variety of reasons. It looks as though your thinking on this may have changed – but figured it’s worth asking. Thank you in advance for your help!

The issue with not bringing all the covariates into the model is that you want the covariates to correlate freely (as covariates should). This may not happen unless you model it. Say that you have e.g. two covariates and missing on X1 and not missing on X2 and you bring X1 into the model (essentially making it a Y). This model may leave X1 and X2 specified as uncorrelated. If you say X1 WITH X2 then you bring X2 into the model, so you have to say X1 ON X2 to correlate them and saying ON can have consequences for the rest of the model. So it is safest to bring all the Xs into the model. I assume you have considerable missingness on that small subset of IVs so that Listwise deletion is not an option.

I am running a complex model with many x-variables. One of those x-variables has missings. If I bring this x-variable into the model by mentioning the variance, the model does not fit any more. The problem is that this variable is now assumed to be uncorrelated with the other x-variables. So I added WITH-statements which brought in all the other x-variables into the model. And I needed to add more WITH-statements. Now I get a warning that the number of observed variables is exceeding the number of clusters in my model.

This puzzles me. If I look at de diagramview, the model seems the same (and when I do this with x-variables without missings the Chi-square and df are also the same) Why is there this large increase of observed variables and do you know a way to deal with this problem? Is there a way to let Mplus estimate the x-variables without increasing the number of observed variables or can I ignore this warning?

You must bring all of the covariates into the model or none of them. You can do this by mentioning the variances. When you do this, they are treated as dependent variables in the model. The warning is to remind you that independence of observation with clustered data is at the cluster level. The impact of this on your results has not been well studied.

Thank you for your answer. I still have a question about bringing in all the covariates. Why does this increase the number of free parameters in the model and in the same time it doesn't affect the number of degrees of freedom.

Hi Drs. Muthen, I am unclear why Mplus is not deleting cases that are missing data for all dependent variables. Notes from my output are below. BWACHGAP is dependent; all other variables are independent. 140,274 is the N in my total sample, but why is this number not decreased, given that some cases do not have data for the sole dependent variable?

Further, Drs. Muthen, I saw the note above that by modeling the variances of exogenous variables, they are treated as dependent and distributional assumptions are made about them; it was implied that this one way to retain cases that would otherwise be dropped due to having missing data on all dependent variables. (Please correct me if that is wrong.) However, I have the following questions: 1) Why would an analyst want exogenous variables to be treated as dependent in the model; what consequences are there to this? 2) When I explored the result of modeling vs. not modeling the variances of my exogenous variables named above, I found that the fit indices changed drastically simply due to the explicit modeling of these variances.

Please see below. Is this drop in the goodness in fit due to the improper assumptions that may be made about the distributions of these variables? Is the assumption multivariate normality? Thank you.

Regarding your first question, perhaps you are bringing x's into the model by mentioning their variances. In this case you no longer have a univariate model for your BWACHGAP DV, but you have a multivariate model for BWACHGAP and all the x's. So even if you have missing on BWACHGAP, people who have non-missing data on at least one x variable are (correctly) kept in the analysis sample.

Regarding the change in fit, I cannot speculate except to say that you should make sue you let all the x's correlate freely.

UG Ch11 states: "NMAR modeling is possible using ML estimation where categorical outcomes are indicators of missingness and where missingness can be predicted by continuous and categorical latent variables."

Yes, I have predicted missingness as a dichotomous outcome in such models--2 DVs are modeled: the outcome itself, and missingness. Both can be regressed on covariates, and this assumes MAR.

1) By correlating these two DVs, we can see if missingness is correlated with the predicted score in the whole sample--whether NMAR is a better assumption--right? Or is this not true, if ML estimation of missing scores (first DV) assumes MAR in the first place?

2) The above strategy (correlating these 2 DVs) works in a 1-level model but not a 2-level model. With the latter I get: "Covariances involving between-only categorical variables are not currently defined on the BETWEEN level."

I can run regressions of both DVs on the between level--but not correlate these outcomes. Does Mplus not allow for modeling covariances of dichotomous DVs on the between level?

I know mixture modeling is another option for NMAR, but given the first statement above, it seems this strategy should work: "categorical outcomes are indicators of missingness and missingness can be predicted..."

Perhaps simply not in a 2-level model where missingness is between only?

Drs. Muthen, I employed the strategy of creating a latent variance on level-2 to define the residual variance for the indicator of missingness, and successfully correlated this residual variance with residual variance with the central variable.

It seems that MAR is a plausible assumption given an estimated value of the correlation of missingness with the DV of about 0: F1 WITH BWACHGAP -0.003 0.424 -0.006 0.995

1) The missing data literature emphasizes that you cannot test whether NMAR is more suitable than MAR. I recommend the book by Craig Enders. This means that your 2- DV approach is not correct. Perhaps because the information on the residual correlation that you focus on comes only from those who don't have missing on Y (the rest is handled by the bivariate normal information). For NMAR modeling you need at least 2 DVs, not counting the binary missing data indicators. For more on NMAR modeling, see for instance: