I'm working on a growth mixture model with a couple covariates. The models seem to be running fine, but when I output the Class Probabilities I get probabilities ranging from 0 to 4 or so. My understanding is that they should range from 0 to 1? So they question is am I misinturpreting the output or is something wrong with my model? Thanks.

I am assuming that you are talking about the output and not saving conditional probabilities in a separate file. When you have covariats, only logit values are printed not probabilities. When you have no covariates, the results are printed as both probabilities and logits. Let me know if this is not what you mean.

I started with an analysis identifying the proper number of latent classes for a particular set of data and ended up with 4 latent classes with the following probability distribution:

#1 78.06% #2 4.41% #3 13.81% #4 3.70%

By itself, this is not problematic until I modify the model to include covariates as predictors of latent class membership (e.g., C#1 on verb3 impuls, etc.). After running such a model with the 4 class solution my probability distribution looks rather different.

#1 68.06% #2 12.95% #3 14.39% #4 4.58%

I don't recall seeing an explicit discussion of this in the manual but would like to understand why this happens and which class distribution should I rely on. At first glance this reminds me of circumstances in conventional growth curve modeling where the parameter estimates vary between unconditional (i.e., w/o predictor variables) and conditional (i.e., w/predictor variables) models.

Your analogy with growth modeling is good. For a stable solution, the two class prob distributions should be the same, but may not be for somewhat misspecified models. For example, some covariates may have direct effects on some outcomes. Note also that the order of the classes may be different for the two solutions.

Another question. Is it possible to work with (i.e., specify, constrain, etc.) the class probabilities in the Model statement (I may just be missing something in the manual). For example, I'd like to constrain two latent classes to be of equal size in a k>2 latent class analysis.

I'm finding that the above isn't behaving the way I expected it to. I'm running a 12-class LCA, in which I'm using training data to constrain members of one manifest group to one set of six classes, and members of another manifest group to the other set of six classes. There are no predictors of the latent class membership involved.

The two sets of six classes are constrained to be identical -- the thresholds for each indicator in Class 1 are constrained to be equal to the thresholds in Class 7, and so forth. With Bengt's help, I was able to get that part going. The next step I want to take is test the hypothesis that the class sizes are identical across sets -- that Class 1 is the same size as Class 7, and so on.

I've used the syntax above to constrain the class size parameters to be equal across groups (1 to 7, 2 to 8, etc.). I've tried leaving Class 6 unconstrained and constraining it to zero (I want it to match Class 12 in size). The model converges, and the "means" in the "latent class regression model part" do follow the equality constraint. However, the "proportions of total sample size" do not, and the variance is not in any obvious pattern.

is based on the estimated posterior probabilities for each individual, given the model and the individual's data. Posterior probabilities should be thought of as akin to factor scores. In most latent class models, the proportions reported here will agree perfectly with the class probabilities as obtained from the estimated [c#...] logit values, but not always - your model is an example of an exception. So I would say that you succeeded in getting the [c#...] parameters set up the way you wanted. The fact that the posterior probability results disagree may be an indication of model misfit, and how they disagree can be a suggestion for how to modify the model.

I'm back to the problem with some other data, and I'm still wrestling with this, I'm afraid. I'm testing a treatment vs. control difference. I've run one model with 8 classes. By training data, the control Ss are constrained to be in odd-numbered classes and the treatment Ss in even-numbered classes. The two groups are the same size. In the first model, the indicator logits for the classes are constrained to be equal across pairs of classes (class 1 with 2, class 3 with 4, etc.); there are no constraints on the latent class probabilities. This seems to be working fine, and fits only slightly worse than a model without the constraints.

Then I want to test the hypothesis that, assuming the same class structure, the class sizes/proportions are the same in the two groups. Eyeballing the FINAL CLASS COUNTS in the first model, it looks like there's a pattern of differences. So I ran a second model, with the following code added to the %OVERALL% portion:

Actually your results are matching up. The final class count is the sum of the estimated posterior probabilities for each individual. Much like the average factor score estimate could be different from the estimated factor mean those numbers are different. Of course you may question the restrictions imposed in the latent class regression model part.

Ok, thanks. Somehow your wording helps Bengt's message from 10/18 click; now I see what's going on. So does MPlus anywhere produce what I was looking for? The raw probabilities of class membership? Or is the best way to exponentiate the logits, and work with the odds?

Dear Linda, I am a new user of Mplus. I am using your short courses package as well as the user Guide. I have started the course on Modelling with categorical latent variable and I very much appreciate it if you would expand a little on the statement on page 13, "The u-c relation is a logit rgeression". with thanks

Look up computations with logistic regression in the Version 3 Mplus User's Guide, chapter 13. This explains the logit in (81) on the page you are referring to. U is the binary dependent variable and c is a categorical x variable. U has a threshold which is the negative of an intercept parameter. If c has 2 classes, then that means that we have a dummy x variable which in line with linear regression means that there are 2 intercepts in the model.

Is class 1 the base class ? i.e., for the coefficient of y20 which is 0.515, does the log odds =1.674 imply that a for change of 1 unit in variable y20, the odds that the indicudual belongs to base class (class 1) increase by a factor of 1.674?

Problem 1: I have a dichotomous X variable (that underlies a continuous distribution) that I want to use as a predictor in a growth mixture model. I can't find an example in the manual on defining this X as categorical, is it unnecessary to do so (if it underlies a continuous distribution)?

Problem 2: On a different project. I have a somewhat unexpected effect of a cognitive measure call X on LGM model slope of achievement (y1 thru y4). The model has 4 time points and the last two are freely estimated. The estimated time scores are 1.5 and 1.8. I am troubling to interpret the effect of X which I would have expected to be positively related to the slope (It is positively related to the intercept). This may be a rather interesting finding, that the higher a respondent scores on cognitive skill X the higher their intercept will be, but the higher they score on X the lower their growth rate.

I plotted the slope estimates by the X variable (by the way I LOVE that we can do this in MPLUS!!) and everything looks fine (yet negative). It seems to me that it’s not their growth rate 'overall' its their initial growth rate (0, 1) because the estimated times scores are 1.5 and 1.8 suggesting a smaller rate of growth at times 3 and 4, that is the first period of growth 0, 1 there is greater incline for those with a lower score on cognitive skill X, but as time goes on (1.5 and 1.8) that initial effect is smaller. Is this correct that the effect of X (when y’s are coded 0, 1, *2 *3 rather than 0 1 2 3) is on the initial growth rate (0 1), we assume it’s linear and stable (as it would be if modeled 0 1 2 3), but when you estimate the time scores you can’t conclude that; or am I unnecessarily getting thrown off by the estimated time scores?

1. The scale of observed x variables is not an issue in estimation. They should not be placed on the CATEGORICAL list. In regression, covariates are treated as either dummy variables or contiuous variables.

2. It is not unusual for a covariate to have a positive influence with the intercept growth factor and a negative influence with a slope growth factor. You may want to use time scores 0 * * 1 if you are interested in growth between the first and last time points.

I am relatively new to mixture modeling and am not sure that I understand the rationale behind constaining class probabilities to be equal. There does not seem to be a clear explanation of the circumstances underlying when this should or should not be done in the manual or on the discussion board. I have a few questions on this issue:

1)Is this primarily a theoretical consideration (i.e., I have no a priori beliefs as to class sizes so I will assume them to be relatively close to equal)? Or is it instead based upon clues derived from an analysis of results from runs in which class probabilities were not constrained?

2)By using this feature, am I somehow forcing data to fit in groups or am I simply making it easier for the model to converge, similar to assigning starting values for class thresholds?

3) What is the relationship (if any) between the logit starting values I assign for class thresholds and constraining class probabilities to be equal?

Thanks in advance - as a novice, I know I'm probably blurring lines on several of these issues but appreciate your thoughtful response.

BTW - I am working with all continuous variables if that makes any difference.

I have run LCA on 3 sets of data N = 197, N = 203, and N = 400 for psychiatric comorbities on a sample of patients (5 comorbidities). I obtain 3 classes as the best solution in all three cases and they make sense clincally. I was interested in the probabilities of belonging to each class for subjects and used the command:

I am working on GMM and I try to decide on the best number of classes. When I choose the model with the best BIC, the class probabilities vary between .65 and .85.

My questions are: Does a class probability of .65 still indicate good fit? Is there a minimum required value for the average latent class probabilities? Is there a rule of thump with regard to the average latent class probabilities to decide on model fit and the number of classes?

Or should I just trust the best BIC and not base my conclusion on the class probabilities?

I have been comparing 2, 3 and 4 latent class models to the same with 1 factor. The best fit appears to be the 3 class 1 factor model. In the final class counts based on the estimated model and estimated posterior probabilities, class 2 has 3.7% of the sample. However in the final classification class 2 has no individuals! Distribution based on most likely class membership is:

C#1 0.272 C#2 0.000 C#3 0.728

I notice from the discussion above that it's possible to constrain 2 classes to be the same size where there's 3 or more classes. I wonder if that is feasible for my analysis? Or is there something else going wrong with my analysis? Fiona

I am running a 3-class multigroup LCGA with the following class probabilities.

group 1: 0.28 0.63 0.09 group 2: 0.25 0.68 0.07

A chi-square crosstabulation (by hand) indicates that the probabilities differ between the groups. Is there a way in Mplus to test directly whether the 0.63 differs from the 0.68? And is there a way to get SE's or CI's for these probabilities?

See Slides 48-50 of the Topic 6 course handout. This shows how MODEL CONSTRAINT can be used to define the latent transition probabilities described toward the end of Chapter 13. Standard errors are estimated for new parameters defined in MODEL CONSTRAINT.

I am currently conducting LCA with 5 binary variables and 3 categorical variables (with 3 categories each). When looking at the results, quite some item response probabilities are set to 0.0/1.0, because the "logit tresholds were approached and set at extreme values". Is there a possibility to turn this default setting off? If not, is this also normally done in LatentGold? (I ask this for comparative reasons; comparing my results with that from a previous study). Also, I was wondering whether you can tell me when one should use a logit, and when a probit link when conducting LCA. The v5 manual only mentions the option with regard to LTA. Which one should you use when you employ a LCA? Finally, in the recent book on LCA and LTA by Collins and Lanza (Wiley series) it is explicitly stated that the likelihood ratio statistic (G^2) shouldn't be used when the df of the model is high because the reference distribution for the G^2 statistic is not known. Is it still okay to use the statistics reported in TECH11, as they are based on the likelihood ratio statistic?

Yes, the likelihood-ratio chi-2 (G^2) does not approximate chi-2 well with the sparesness in the observed-data frequency table that you get with many variables. And you cannot use chi-2 difference testing to find the appropriate number of latent classes; that also is not chi-2 distributed. Instead BIC has been found to work well. And also TECH11 and TECH14. See for example the article

which is on our web site under Papers, Latent Class Analysis. TECH11 and TECH14 are akin to the bootstrap approaches mentioned in the book you referred to, although not focusing on the frequency table chi-2, but directly on the likelihood-ratio statistic, considering its non-chi-2 distribution.

Hello, I am trying to determine how best to characterize derived classes. I noticed that in the "results in probability scale" section of Mplus output, there are significance values associated with the probabilities of each latent class indicator. Are these p-values indicating whether the probability is greater than 0 or testing another hypothesis? In addition, in one case, the probability is significant even though the probability value is 0. In other cases, when the probability value is 1.00, the p-value is also 1.00. thanks in advance! Marcy

I presume this is a very basic question about Latent Class Analysis. I am running a two group LCA (trying to discriminate MZ and DZ twins from a bunch of ratings of their similarity). I have good reason to think that the solution should be unbalanced, with about 90% of the cases in one group and 10% in the other. But no matter what I do I seem to end up with close to 50-50. Is there a way to force Mplus to find an unbalanced solution. Is it a matter of fixing the mean of the latent categorical variable on a logit scale? Can one do it with starting values? Or fixing the loadings on a few select items? Thanks.

Dear Bengt/Linda, I am working on a GMM/LCGa type model which includes about 10.000 cases. I first identified the correct number of classes, and now wish to regress c on some covariates. However, I'd also like to treat the most likely classes from the first stage as observed variables in the second stage, just to check my model solutions. I found that the maximum recordlength is set at 5.000 in mplus 6, so I have some problems saving the class probabilities. Can you recommend any work-around procedure? thanks heaps,

Sorry, I should be clearer. I have a fairly large dataset (10.000) ans wish to save class probabilities. There seems to be a recordlength maximum of 5.000 however. Is there a way to save data for more records than 5.000?

Hello, I am working on a LCA with longitudinal binary data and estimate thresholds for each class. However, when I add an intercept to the overall statement of the model (which is actually a growth parameter), the model fit is considerably improved.

How can this be, and does this affect the interpretation of the thresholds? e.g. do I need to standardise them now as they will be affected by a latent continuous variable (intercept)?

It sounds like you are freeing the intercept of the intercept growth factor. This should be fixed at zero as part of the growth model parameterization. It should not be freed. If this is not what you are saying, please send your output and license number to support@statmodel.com.

Dear Dr. Muthen, I am using a mixture modeling approach to crime rates at the county level. The best fitting model was the one with four latent classes. When I ran the frequencies on the class variable saved in the data file and compared those with the class counts reported in the output, I noticed some differences:

Hi, I am estimating LPA models from plausible values (20 sets) from a previous ESEM model (i.e. I want to estimate the profiles based on the "factor scores", but tried PVS for greater precision). Doing so, I get warnings that I cant save class memberships and most importantly class probabilities from the models ? I guess that is because we work from 20 different data sets. I am wondering whether there would be a way to save these information (i.e. to combine/merge the class probability results into a single data file) ? Thanks

I would very much like help in locating the formula for calculating CPROBABILITIES for the individuals in a mixed model.

What I mean is each individual's probability for belonging to class 1, class 2 etc. when knowing the person's item scores and the Mplus estimated model parameters for a certain mixed model.

Mplus gives these probabilities in a save file when demanding CPROBABILITIES, but I would like to be able to calculate them myself based on the parameters estimated by a Mplus model (like thresholds and alpha).

I have looked into the Technical Appendix 8, but it seems that the formula/algorithm to use is not there.

u* comes from eqns (150) and (151), so it is also a function of model parameters - and covariate values x_i. You should view u* as the logit behind each observed u, so the parameters of those two eqns are found in whatever relationships in your model that influence the u's.

I have an additional question concerning calculation of posterior probabilities:

I tried the method of calculating posterior probabilities which is described in topic 5 slides 69 and 70 (Berlin july 2009 version). I have a simple model with one latent class variable (with two classes), and with 10 binary indicators.

I estimated the model in Mplus. Then I plugged the probabilities from this model into the equations in slides 69 and 70 for some selected persons. The result was that I got the same posterior probabilities which was calculated by Mplus.

My question is: In slide 70 it is indicated that I have to use the EM algoritm in order to get the posterior probability for a person, treating the class membership as missing data. Why was this in fact not necessary when using the parameters from the estimated Mplus model?

The probabilities you plugged in (using slide 71) were from the Mplus output and Mplus computes those from the estimated model parameters - and those are obtained via ML estimation using the EM algorithm.

I’ve run a LPA analysis with one sample and I have determined the number of classes to retain. Using the classes that were determined in the LPA analysis, I would like to determine the most probable classes for a new sample of individuals. For example, I would like to give the same items that were used as class indicators to a new sample and using the results of the LPA to predict which class each person would most likely to be in. Ideally, for each person in the new sample, I would like to be able to use the item responses and compute the probability of being in each class, without having to run a new LPA. Is this possible? If so, does MPLUS output provide the necessary information to compute the probabilities? What equation should I use?

Yes, this can be done and is a good use of LPA. For the new sample of individuals you simply fix all the model parameters at the solution you got from your first sample. This can easily be done using the SVALUES option in the OUTPUT command. The second run then does not estimate any parameters but only estimates the posterior probabilities for each class for each subject (see the CPROBS option of the SAVEDATA command). The results also include an indication of the most likely class. In fact, you can do this for only a single subject.

Thanks for the response. Is there a way to estimate the probabilities without running MPLUS? For example, we would like to give each individual in the new sample the set of items, then, in an on demand environment provide feedback to the individual about the class they would most likely fall into based on their pattern of responses on the items. We would like to be able to set this up in an online or computer environment using a standard programming language such as visual basic or javascript.

I tracked down and deconstructed the posterior probability equation that LPA uses to estimate the probabilities. Since I know the latent class means, latent class covariance matrices, and the response vector, I could use the equation to produce new probabilities for each new individual. The formula relies on complex matrix algebra to estimate the density functions, which isn’t a show stopper, but we were hoping to find something that might be simpler.

My Questions: Have you ever seen anyone use LPA/LCA in this type of application? Do you know of any other way to compute the probabilities that might be simpler than relying on the posterior probability formula? We were hoping that LPA would be similar to discriminant analysis and produce a set of equations that predict class membership for a single person. Finally, I wanted to check which density function MPLUS uses to compute the posterior probabilities. Does it use the multivariate normal density function?

I think this is a not uncommon type of application and I know others who have had similar interests in having their own routine. I don't think one can avoid going via the computations of the posterior probabilities (like we show in Appendix 8 in the V2 Tech App.). Discriminant analysis assumes known groups, whereas with the posterior probabilities of mixtures a subject is a fractional member of several groups. Yes, Mplus uses the multivariate normal density when the outcomes are continuous.

I have a question regarding interpretation of conditional probabilities for particular classes. If the conditional probability for a particular response category in a particular latent class is say 0.75 can one say that 75% of respondents in this class selected this category or is it only appropriate to say that respondents classified here had a 75% chance of selecting this category. I am of the opinion that in this case these two are interchangeable since the probabilities also represent proportions. Is this the case?

"respondents classified here had a 75% chance of selecting this category". This is the probability that is being estimated. Even better is:

"respondents in this class had an estimated probability of 0.75 of selecting this category."

- That avoids the ambigious phrase "classified here". Note that we are talking about a model, that is about subjects' class membership (a subject is a member of only one class in the model), not how they were classified after the parameter estimation was done (when they are fractionally members of several classes).

The other statement sounds like you are talking about the subjects who are most likely in this class and among them 75% are in this response category - that may not be true.

I am running a LCA with 3 continuous level variables and 2 dummy variables (reference dummy omitted: D_Me). I am not clear how to interpret the conditional probabilities for responses on the dummy variables.

1. Why does MPLUS assign 1 and 0 to some of the conditional probabilities - how can I interpret this?

I am now treating the 3-category variable as categorical (it is ordered).

To be clear:

1. The model terminates normally, but I am getting this message: IN THE OPTIMIZATION, ONE OR MORE LOGIT THRESHOLDS APPROACHED AND WERE SET AT THE EXTREME VALUES. EXTREME VALUES ARE -15.000 AND 15.000. THE FOLLOWING THRESHOLDS WERE SET AT THESE VALUES: * THRESHOLD 1 OF CLASS INDICATOR CE_CAT FOR CLASS 3 AT ITERATION 110 * THRESHOLD 2 OF CLASS INDICATOR CE_CAT FOR CLASS 3 AT ITERATION 110

Should I be concerned?

2. The "Results in probability scale" can indeed be interpreted as the probability each class falls into a specific category of the ordinal level variable?

We would like to calculate the percentage of correct class assignments in a simulation study of latent profile analysis (no covariates). Is it possible to save the individual posterior probabilities for each class and the most likely class membership(save=cprob)in a Monte Carlo simulation with multiple replications to compare to the Monte Carlo data sets that provide true class memberships? Or, is there another way to obtain this value?

While performing a 4 class latent profile analysis, the SAVEDATA = CPROB; option only results in a datafile in which the values of the variables + id variable are included. So, no posterior and class probabilities and also no assigned class number. Could you tell me what I'm doing wrong?

I'm using LCA and LTA and my questions here concern which data to present in tables and graphically. Should I output the results into a data file, requesting cprobabilities, and then compute the means of each item for each class using each person's most likely class membership? Or, should I graph the class probabilities of each item in the "results in probability scale" on the output file?

The means obtained from averaging most likely class are different from the results (in the probability scale) on the output, as they are different things, but I'm not sure which is better to present.

Can you also clarify the difference in the most likely class membership and the probabilities?

I would use the Mplus Plot command to plot the means/probabilities for the items in each class.

I would not use most likely class unless necessary and it isn't here.

Each person gets an estimated probability (cprob) to be in each class. So with 2 classes the person may get the cprob values 0.85 and 0.15. The most likely class membership for that person is then class 1. But of course that is a cruder piece of information than using both 0.85 and 0.15. A second person with cprobs 0.90, 0.10 is a bit more clearly a class 1 member.

Hi Dr Muthen, I just ran a 3-class factor mixture model using Bayesian estimator, which took shorter time to reach convergence than using algorithm integration. The output looks interesting, I couldn't however figure out the response pattern in each class as no item-response probabilities were given. I wonder if I need to use an auxiliary variable and export class probabilities into a different file and analyze class characteristics that way. Is there a way to obtain item-response probabilities in the output?

Thanks Dr Muthen. I've read your work on Bayesian and have come across "label switching", and it seems that there isn't a solution yet to this, is there? Another interesting observation is, and I read it somewhere that, model modifications do not really have much impact on the overall fit of the model assessed using PPC? why and how do you then work out potential areas of misfit? also, can you request DIC in mixture models? If not, are there alternative indices that allow for model comparison?

We discuss the label switching in our Topic 9 Bayes course on the website. You can apply constraints. Because of this and the current lack of measures for model comparisons, I would not recommend Bayes for mixture modeling unless you are an expert Bayes analyst.

I don't have the PPP experience you mention. That is, model improvements can be seen in the range given in the output below the PPP value.

I am currently on a project in which I need to calculate the probabilities of belonging to a set of latent classes, given individual responses to the latent class indicators. I can do this when the latent class indicators are categorical but I can't find an equation that will calculate these probabilities when the indicators are truly continuous. Could you point me to such an equation?

I saved class probabilities and then opened the saved probabilities in WordPad. How do I determine which column represents which class? Any suggestions to help interpret this issue would be much appreciated.