I would like to understand what the difference is between the Mplus estimation method for categorical items and the Rasch model for scales (or test) construction. In other words, are the several models (rating scale, partial credit and equal dispersion model) extraneous to Mplus estimation method or not? Thanks for the answ

With a single factor, Mplus estimates a 2-parameter normal ogive to use IRT terms. A transformation to IRT parameters a and b is given in Muthen, Kao, Burstein (1991) - see IRT reference list under the Mplus Discussion topic References. Guessing is not taken into account. This means that results are close to those of 2-parameter logistic, using the usual conversion factor to make probit results close to logit results. Differences also arise because Mplus uses weighted least squares (WLS) and not maximum likelihood (ML). Mislevy had an article in Journal of Educational Statistics in 1986 showing that WLS and ML gave close results in multi-factor situations. A Rasch model can be estimated when holding loadings equal across items. Differences with Rasch programs would again be due to WLS versus ML; note also the custom of average difficulty = 0 in Rasch. I don't think Partial Credit or the Rating Scale model in line with Master's and Andrich can be handled in Mplus, but maybe other users would know. Mplus handles ordered polytomous outcomes, which is in line with Samejima's graded response model.

I'd like to continue the discussion of IRT-like analyses in MPLUS. Specifically, I have analyzed the same data set using Mplus, R. MacDonald's NOHARM program, and the BILOG-MG IRT program. The results differ across programs. My question is whether I should worry about such differences, or if they're within the range of what one would expect, given differences in estimation procedures (e.g., ML vs. WLS).

The details are as follows: I'm analyzing 11 dichotomous items, with a sample size of 3091. In Mplus, I specified Model: F1 by I1* I2-I11; F1 @ 1; (thereby constraining the factor variance to be 1.0).

I won't overwhelm everyone with all the results, but here's the estimates for two items:

My understanding is that estimates from Mplus can be transformed into the parameterization used in BILOG by the following: BILOG slope = loading/(sqrt(1-loading**2)) BILOG Threshold = threshold/loading

For item #1, this calculation yields 1.542 for the slope/loading, while BILOG reports 1.674. For the threshold, the formula yields .344 while BILOG says .295.

Again, the issue for me is whether such differences are cause for concern, or if that's just the way it is. Thanks in adnvance to anyone who can provide illumination!

I think the magnitude of the differences between BILOG and Mplus are not a cause for concern. They are due to using ML instead of WLS and using logit instead of probit (the constant 1.7 in the logit gives only an approximate closeness to the normal). Note also that the item response curves may be close even when the thresholds and the slopes are a bit different (these are correlated quantities). As for thresholds, Mplus WLS and Mplus WLSMV differ a bit due to the difference in the weight matrix. With WLS the thresholds are not simply inverse normal transforms of the sample proportions, but are a function of the weight matrix covariance between thresholds and correlations and the fit of the correlations, see Muthen (1978) Psychometrika equation (20).

Is there a standard method for quantifying the 'bias' in the estimate of mean differences in latent ability according to group membership when measurement non-invariance (DIF) is not taken into account?

Here's what I tried. I built a single-group MIMIC model and detected significant direct effects yada-yada-yada. Then I estimated a mis-specified model with previously detected direct effects fixed to 0.

Now comparing overall model fit isn't really interesting because I already know those direct effects significantly improve model fit. What I'm really interested in is how wrong would I be about inferences made on subject ability if I ignored the direct effects?

I chose to compare the Std parameter estimates for the indirect effects for each model, because the residual varinaces of ability are different across models and the background variable is binary.

The full model returns a Std indirect effect of 0.490, the mis-specified model 0.498. Thus I conclude that ignoring bias results in over-estimating group differences by a whopping 1.6%.

I think your approach makes sense. In addition, with several x's, you could look at the standardized mean differences for different groups, e.g. consider how differently, in a standardized metric, black females compare to white females with and without a direct effect. As a complement, I guess one could also ask what happens to the factor score for a given individual with and without a direct effect.

Based on the above discussion and Muthen and Christoffersson (1981), I was wondering why it might still make sense to perform an Mplus multigroup CFA for categorical variables if one doesn't render it as a 2P IRT model.

Unless the scale factors are constrained across groups, it doesn't seem that that differences in group means and variances are necessarily meaningful since the definition of the measures could still be quite different for each group in spite of having equal thresholds and loadings.

Differences across groups in factor means and variances are meaningful if thresholds and loadings have (partial) invariance across groups even if scale factors are different across groups. A useful analogy is with continuous indicators where in order to compare factor means and variances across groups you do not need invariance across groups of error variances in addition to invariant intercepts and loadings. With categorical outcomes, allowing scale factors to differ across groups in the presence of invariant thresholds and loadings can be thought of as allowing noninvariant error variances.

Within the Mplus CFA framework then is there no meaningful way of "correcting" for DIF by imposing a direct effect between a covariate and an indicator (as there would be if the CFA was rendered as an IRT) ?

I did not realize that you were asking about direct effect DIF modeling in our last message, but that does not change my answer. I conclude the opposite from our discussion: including a direct effect is a meaningful way to correct for DIF, just like it is with regular CFA with covariates. Please let me know how I can help get us to get agreement on this issue.

My second question was a follow-up rather than a restatement of the first.

I am unclear as to how a direct effect between a covariate and an indicator adjusts for DIF for an Mplus CFA with categorical indicators. This may be due to confusion regarding the nature of scale factors in the Mplus CFA and the correspondence between the Mplus CFA and a 2P IRT.

The interpretation of adjusting for DIF in the case of a 2P IRT is straight forward: the DE adjusts for differences in item difficulties _within_ and _across_ (via the imposition of equal thresholds, loadings, and scale factors) groups of a multigroup model.

The extension to the Mplus CFA with categorical indicators approach is less clear: would DE's only adjust for differences in error variances _within_ groups of a multigroup model (and in so doing, adjust the loadings and thresholds as well -- which are _still_ constrained to be equal across groups even though the scale factors are allowed to vary) ?

First of all, there is no difference between the two-parameter normal ogive IRT model and the Mplus model.

To summarize my view, there are two ways to capture DIF in Mplus modeling: (1) CFA with covariates and (2) multi-group analysis. To me, DIF means that for a given item you have different item characteristics curves for different subject groupings and both approaches capture this.

In approach (1), DIF for a certain item is handled a binary (say) x variable is allowed to have direct influence on the item in question, thereby making its threshold (difficulty) parameter be different for the two groups. Direct effects are only needed for items with DIF. Scale factors are not involved here.

In approach (2), considering the same two groups, DIF is handled by a two-group analysis where the DIF item is allowed to have different threshold and loading for the two groups. So this is a more general DIF form. Scale factors are fixed at 1 for this item (free scale factors are used for items that are assumed to not have DIF).

For further details, please see the Muthen articles given on this web site under References-Categorical Outcomes-IRT, especially papers 35, 18, and 15. These references also show the relationships between Mplus and IRT parameterization; see also the IRT paper by McIntosh listed on the home page.

I am trying to do a cross-validation of the model I've settled on for a particular set of all categorical data and was wondering how I would apply the same thresholds/loadings to the cross-validation sample to derive factor scores.

If a one factor model with categorical indicators can be transformed quite easily to yield a unidimensional logistic IRT model, what does this mean in the multidimensional case? Specifically, the majority of multidimenional IRT models that I have seen are compensatory in nature with each dimension providing its own discriminating parameter to the item characteristic function (yielding an "a" vector) and a scalar defined level of difficulty/threshold (see Ackerman, 1994). If we extend the same procedures that are used to transform the loadings/discrimination and threshold parameters for the unidimensional case in Mplus to what we might expect to get if there was a BILOG/PARSCALE analog for multidimensional IRT, we end up with not only multiple "a" parameters as in the compensatory model, but also multiple thresholds. To me this suggests that the multidimensional model derived in Mplus would in effect be noncompensatory in nature. Is this in fact the case are am I way off?

I was a bit unclear. In order to transform the Mplus derived parameter estimates to what would be expected with a logistic model we:

divide the Mplus loading by the square root of the quantity 1 - (Mplus loading squared) to get the analagous "a" parameter from BILOG/PARSCALE; and divide the Mplus derived threshold by the Mplus derived loading to get the analagous difficulty/threshold parameter.

For a one dimensional model this is fairly staightforward - yielding only one discrimination parameter and one set of thresholds per item.

In a multidimensional item; however, if you do the transformation for the thresholds for each dimension, you end with as many sets of transformed thresholds as you have dimensions.

Is there another way to approach the transformation issue for the multidimensional case?

Good question. You are right about the division by the square root of the quantity 1 - (loading squared). Explicating what this quantity is, makes the generalization to multiple factors clear. The quantity is the residual variance of y*, the continuous latent variable underlying the binary observed y. In the Mplus factor analysis with categorical y's, the y* variances are standardized to one. The loading squared inside of the parenthesis is the variance in y* explained by the factor when we have a single factor with variance one. In the more general case, the variance explained can be calculated in the usual way for continuous y's to include cases with several factors that may be correlated and/or have variances different from one. So, for instance, with two correlated factors, you have variance explained in item j equal to V(f_1)* lambda_j1**2 + V(f_2)*lambda_j2**2 + 2*lambda_j1*lambda_j2*cov(f1,f2).

I'd like to return to the discussion of quantifying DIF for a scale. Several points were raised in the exchange between Rich Jones and Bengt back on April 6. I'd like to ask for a little more explication, if possible. The basic situation is one in which one runs a MIMIC model with DIF effects and a MIMIC model without such effects, to assess the overall magnitude of DIF.

1. Rich Jones proposed comparing standardized parameter estimates of the indirect effects for a MIMIC model with no DIF and a model that contained DIF effects (i.e., direct effects from the background variable to the item(s) in question). I assume that the standardized indirect effect would be the product of the (STD) effect of the background (x) variable on the factor times the (STD) loading of an indicator on the factor. My question: If one has several indicators, does it make sense to assess *overall* DIF for the set of indicators by looking at the *sum* of the loadings of the indicators times the effect of the x variable on the factor, and comparing these quantities in the DIF and no-DIF models?

2. Bengt also suggested that one could look at "standardized mean differences" for different groups, defined by the background factors (x variables). Perhaps it's the summer heat, but I'm not sure how one would calculate standardized mean differences from the Mplus output. Suppose one had dummy variables for each combination of background variables. Would one compare the magnitude of the (STD) effects of each dummy variable on the factor in the DIF and no-DIF models? Or does one calculate standardized mean differences in some other manner?

3. Bengt also suggested comparing factor scores in the DIF and no-DIF models. When comparing the mean factor scores for two groups differing in their background characteristics, would it make sense to incorporate some information on the within-group standard deviation of the estimated factor scores, to provide a scale for the factor score comparison (e.g., something analogous to a measure of effect size)?

4. Finally, suppose one had two background variables in the MIMIC model (e.g., age and gender). One could form dummy x variables for each age/gender combination. If one had 4 groups, one would use 3 dummy variables in the model. The question here is: How can one recover the predicted mean for the excluded group? In a standard regression, one would use the intercept, but there's no intercept in the Mplus formulation. Is there a way to use the overall predicted factor mean from the TECH 4 output for the purpose of recovering the mean of the excluded group? (And is there somewhere in the Appendices that provides the formula used to obtain the estimated factor mean in the TECH 4 output?)

These questions may have obvious answers; I'm just not seeing them right now. Thanks in advance for clarifying these points.

Thank you for the explanation of establishing the variance in an item explained by two correlated factors. In terms of the item thresholds; however, would we end up with multiple sets of thresholds per item (one set for each dimension) or is there a way to transform to one set?

Mplus has just one threshold for each item, even when there are several factors. This is because the threshold is defined on the scale of the latent response variable y*, and there is only one y*. In achievement testing, y* is the specific ability needed to solve a certain item correctly. This ability may consist of several factors.

The relationship between IRT parameterizations and Mplus is most clearly seen in P(y=1|f), the probablity of y=1 given the factor (or factors) f, expressing this as a normal distribution function with argument arg, say. For a single theta in IRT,

arg = a(theta - b)

whereas in Mplus with a single factor f,

arg = (-tau + lambda*f)*c,

where the inverted value of c is the square root of the residual variance that we discussed earlier. This gives the relationship

a = lambda*c, b = tau/lambda.

With several factors in Mplus,

arg = (-tau + lambda_1*f_1 + lambda_2*f_2 + ...)*c.

This formula can then be used to derive the relationship to the various IRT formulations. Hope this helps.

Here are some thoughts on John Fleishman's questions. Rich Jones may have further ideas.

1. It would be nice to have some overall indicator of the DIF, for all items involved. Perhaps the sum of the squared differences between the standardized indirect effects with and without direct effects. Perhaps root mean square.

2. One can get the factor means and factor standard deviations from TECH4, then compute mean differences divided by sd.

3. Sounds alright.

4. The estimated mean for the excluded group would be zero since a single-group analysis fixes the intercept for the factor at zero (in line with mean fixed at zero). TECH4 would only give the marginal factor mean (so averaged over all background variable values). In line with equation (38) in version 2's appendix, the factor mean conditional on x covariates is

The R-square values that are printed if you request a standardized solution would serve this purpose. The R-square describes the proportion of variation in the latent response variable y* accounted for by the multiple factors influencing this y*. See Appendix 1 of the User's Guide for more details about y*.

In response to John Fleishman's request for an explication of my Single-Number Summary of DIF...

Direct and Indirect Effects in Mplus MIMIC models

In the case of a single-factor MIMIC model (single factor CFA with covariates), what I refer to as the indirect effect can also be conceptualized as a regression of the latent factor (eta) on a given x. When the x is dichotomous, these regressions are analogous to dummy variable regressions in ordinary linear regression or mean differences as expressed in ANCOVA models. So while the parameter describes the indirect relationship between the x (e.g., group membership) and the item(s), it also captures group mean differences in the underlying construct (eta) (this is a powerful feature of DIF detection with a MIMIC model that is difficult to get at with more usual DIF detection approaches).

Mplus will produce regression parameters standaredized with respect to the variance of eta (STD), and also standardized to the variances of eta, the x's, and the y's (STDYX). The indirect effect, when standardized with respect to the variance of eta (STD; STDYX is harder to interpret in the case of dichotomous x's), describes a kind of effect size difference for group membership: that is, the standard deviation increase in eta associated with a unit increase in the covariate (i.e. group membership).

I am very interested in the study of DIF, but I also believe that DIF is often at best a nuisance. What I am really interested in is how cognition, depression, functioning, whatever, is distributed within a sample, across groups, or how it relates to some other characteristic of subjects. The presence of DIF might lead to spurious inferences of group differences or exaggerated relation to other correlates if the DIF is of significant magnitude. Most studies of DIF that I have seen conduct an analysis of DIF, interpret (some) of the findings, and stop there without going on to explore how the DIF might impact findings or relationships with other variables. Often, DIF studies only interpret evidence of bias in the direction favored/suspected a priori by the investigator. I was just trying to go one step further and try to conceptualize/express the overall magnitude of DIF and the importance of modeling DIF or, conversely, the cost of ignoring DIF.

Overall Summary of DIF

In my posted example using the single-group one factor MIMIC approach, I built a MIMIC model in a forwards stepwise function, examining model fit derivatives for evidence of DIF and sequentially free-ing up direct effects etc. as described in Muthén, B., in Test Validity, H. Wainer & H. Braun, eds. (1988). The initial model (without any DIF /direct effects) suggested significant and large group differences in the underlying construct (eta) - in other words, the regression of eta on x was large and significant: a significant indirect effect. The final model, one that included DIF according to group membership suggested many items with DIF (a.k.a. direct effects), and group differences in the underlying construct remained (indirect effect).

I was interested in trying to describe how much of the observed group difference in eta was due to bias (DIF) by group membership. As in other areas of statistical inquiry, large samples may lead to an analytic finding of statistically significant DIF, but the magnitude of the effect is of little practical importance. Also, sometimes you'll find DIF favoring one group on some items, and suggesting a disadvantage on other items, and given different difficulty and perhaps even discrimination across indicators, it's hard to gauge the overall importance of DIF.

The approach I took was to compare from the final model (with DIF) the STD indirect effect (regressions of eta on group membership) for the group membership dummy covariate (x) to a model otherwise equivalent with the exception of the direct effects (DIF) - a purposefully mis-specified model. You can get an omnibus chi-square difference test and p-value for all DIF parameters this way, but you already know this will be significant given the way the MIMIC model was built. What I was trying to do was get a handle on how large the group differences would be if the DIF was ignored.

In my posted example, the final model returned a STD indirect effect of 0.490. That is, the standardized (with respect to the variance of eta) difference in eta was 0.490. This value was slightly lower than the standardized group difference in eta found for the purposefully mis-specified model: 0.498. I further expressed the discrepancy in group difference estimates as a fraction of the mis-specified group difference ((0.498-0.490)/0.498)=1.6%. This result is very interesting because although my analysis demonstrated significant DIF, the practical importance of this DIF in terms of obtaining an un-biased estimate of eta seems to be relatively small. Therefore, I concluded that most of the group differences in eta were not due to possible item bias (but may be due to constant bias - but that's another topic, a matter of substantive interpretation of the indirect effect).

So I use the fugure 1.6% as a single number to quantify the discrepancy between DIF and no-DIF models in terms of the underlying construct. Bengt suggested other ways, for example estimating factor scores for the final and mis-specified models and plot them, compute their correlation, or estimate mean of differences, etc.

Other DIF Summaries

I have considered other summaries, for example something analogous to a sum of the area between the item characteristic curves (ICCs) for focal and referent groups. Other postings on the Mplus discussion list describe how MIMIC model parameters can be used to obtain IRT parameters, and Raju (1988; Psychometrika 53:495-502) demonstrated that the area between two group's ICCs is equal to their difference in IRT difficulty parameters for 1-P and 2-P models. So you could convert direct effects, thresholds and loadings to difficulty parameters, and compute the sum of group differences in item difficulty across groups as another single-number expression of the total effect of DIF (but notice however that direct effects are they only things that vary across group in a single group MIMIC model).

Limitations of this area approach are that it does not take into consideration the distribution of ability in the sample of interest. If the items are highly skewed -- all very difficult or very easy (as they often are in fields outside of educational testing such as psychology, medicine, epidemiology) -- this sum of area's may mis-represent group differences. It's possible that the sum of the area's between the curves is very large but if weighted to the distribution of ability in the population and with all of the items most discriminating at the tails of the ability distribution, there will be very small differences in estimated group differences due to DIF. I believe this is what is happening in my example, where large and significant DIF explains little of the overall group difference in estimated ability because very few respondents have a level of ability that matches the difficulty level of the test items.

When estimating the correlation between two continuous latent variables with binary observed variables, does Mplus incorporate a correction for attenuation due to unreliability (like a factor analysis model does for continuous observed variables)? Or does it compute correlations between latent variable estimates (as I think is the case for IRT programs like BILOG)?

I'm trying to develop an IRT-style scale for several polytomous items that are measured longitudinally. I'm somewhat surprised by some of my results, so I'm hoping you can check my logic. I have too many items/response categories to do a full longitudinal CFA (MPlus insists there's not enough memory in the workspace, no matter how much I try to give it). So I did a one-factor CFA with one year's data, using WLSMV, and was satisfied with the outcome. I'm willing to make the assumption that the items have the same measurement relations to the factor across years.

I then ran the model for different years, fixing all of the loading and threshold values to be equal to those from the above CFA, and saved factor scores. My thinking was that this would give me a score for each year, all on the same scale. However, the factor means I'm getting out of these runs are surprising to me. It may be that they're right and my expectations were simply wrong.

However, I noticed a pattern. Some of the years don't have all of the items, so I omitted those items from the scoring runs for those years. With all of the thresholds and loadings fixed, I'm working from the idea that that's as if those variables are missing (completely at random, in fact), and shouldn't affect the scaling. But the years with missing items all have the highest factor scores.

Off-hand, I don't see that this approach has any gross error, although I may be overlooking some scaling issue. As a check of reasonableness of the results I would compare this to treating the items as continuous and study the mean development for the average at each time point (average to take into account missing items).

Thanks. I discovered the key problem was in my assumption that the scaling was constant across years -- there were certain items that were phrased differently in different years, making them much "easier items" in some years than others. I appreciate the check on the logic.

Just starting to get into measurement equivalence assessment using two group CFA. Based on reading so far, I think that when using Mplus on binary math item indicators if I have partial measurement invariance between M and F on Math it is reasonable to assume that the factor means and factor scores can be estimated and that I am measuring the same factor in both groups. True so far? But I am interested in these factor scores compared to 2P IRT derived scores. I have seen on this board the formulas for converting Mplus parameters to BILOG parameters. I am wondering if the Mplus factor scores will correlate with other variables in the same way as BILOG parameter based scores. Maybe an obvious answer, but a stretch for me. Likewise would differences in group means relative to variances be the same for Mplus factor scores and BILOG scores?

Mplus Web Note #4 goes through different parameterizations including IRT and shows that the relationship between the item and the factor is the same. This means that the factor scores are the same. The only difference between BILOG and current Mplus is the difference due to BILOG using a logistic function and Mplus using a probit function; this should produce only minor differences wrt factor scores.

Thanks for your very helpful response. Is it correct that partial measurement invariance is somewhat subjective as to whether you have it? If you have it, or enough of it, am I correctly interpreting your comments on the board that you can estimate factor scores for the two groups and DIF is controlled? This is extremely helpful. Thanks.

RE: Rich Jones on Friday, April 06, 2001 - 11:11 am: suggestion on an effect size for DIF. Would I be looking at the more general DIF by using a two group analysis? Would I would then compare the factor mean computed for the withDIF group computed from a model recognizing some noninvariance (but still retaining partial measurement invariance)and the withDIF group factor mean computed from a model assuming invariance. Correct? I guess there is no scale problem as long as I use the same item as reference indicator throughout, even though I free us some parameters? I wonder if there is any way to get a standard error on that difference or an upper bound on the standard error?

Clarification of my Q. My term withDIF group makes sense in my context. That is a test is administered and scaled under standard conditions and then administered to others under nonstandard conditions. So this last group is the withDIF group I was refering to. Upon reflection my issue may be yet more complex. Getting the scores of the withDIF group controlling for DIF is no problem, I think. However the comparison would be with their scores if they were scored using the item parameters derived from the standard group only. Nothing in my described analysis gives me that?

Yes, when moving from a mimic-type analysis to a 2-group analysis I think comparing the factor mean from the correctly specified model (with the non-invariance in question included in the model) with the factor mean from the incorrectly specified model makes sense. I don't know about the s.e. of the difference in means. In a sense the s.e. for the mean for the correctly specified model would be somewhat useful - for example, it is informative if the incorrect mean is more than 2 s.e.'s away from the correct mean.

In Re Magnitude of DIF: Comment, a shameless plug, and a proposed rule of thumb for interpreting magnitude of difference

I like to think of the parameter estimate for the indirect effect (or mean difference in multiple-group case) and associated standard error as test of significance, and the comparisson of parameters from fitted and mis-specified models as kind of an effect size measure.

I've recently learned that Maldonado and Greeland (1993, Am J Epidemiol 138:923-936) describe a simulation study used to evaluate different strategies for identifying important confounders in observational studies. Their strategy might be adapted for evaluating the model mis-specifcation approach to detecting "confounding" in the ability estimate due to DIF. These authors ultimately reccomend a "change in estimate approach" similar to what I propose in Jones & Gallo (2002), but using a pre-determined threshold (e.g., |b-b'|/b' > 0.10, where b and b' are parameter estimates from mis-specified and fitted models, respectively) as a criterion for marking 'important' confounding.

This 10% difference rule of thumb might be as good as good rule of thumb as any, short of simulation studies or other indications that the detected DIF makes an important or substantial impact. BTW: I should mention that I came to the Maldolondo and Greenland work by way of Crane, Jolley and Van Belle, who use it as a criterion in assessing the presence of uniform DIF in their DIFdetect procedure (see http://www.alz.washington.edu/DIFDETECT/welcome.html).

However, I can see that it would be nice to have some indication as to how confident we can be that the difference between fitted and mis-specified parameter estimates for mean difference in underlying ability is less than 10%.

Setup: I am trying to use the Mplus factor score estimator to produce equivalent latent trait estimates for two sequential administrations of the same symptom inventory. I need to generate equated, or linked, scores because the wording of response options (but not symptom stems) changed between administrations. All items are treated as dichotomous (symptom present/absent).

Approach: I am linking the two models by (1) estimating factor loadings and thresholds in the first administration, with the variance fixed at 1 and mean 0 for the single latent factor, and saving factor scores; and (2) estimating latent trait estimates (factor scores) at the second administration, constraining the loadings for all items and the thresholds for the items that are (assumed to be) equal across administration to be equal to those estimated at the first administration. I've estimated two seperate models, so that by default Delta=I in both administrations. Further, in the second model, the only free parameters are the mean of the latent trait, the thresholds for the items for which the wording changed, and ...

Questions: (1) Should I hold the variances of the latent factor to be equal (i.e., 1) at the second administration?

(2) Do you think this would be more appropriately parameterized as a multiple-group model and bring Delta into the picture? (i.e., fix Psi to 1 and freely Delta for group 2 where group 2 is really administration 2?)

I realize that if all items were exactly the same, if I did not constrain the variances to be equal (along with all loadings, thresholds), the metric of the latent trait would change, and I would get a different latent trait estimate for identical response patterns. I'm just not sure if I can expect (assume) the latent trait variance would/should be equal when the thresholds for the items are very different.

...but my idea was to treat them as seperate groups in a scale equating phase of the analysis, and then look at longitudinal changes in a seperate set of analyses. I realize this is not neccesary with Mplus, but a secondary goal is to provide a set of equated scores for other investigators to use (who might not use Mplus).

You can do this by a "longitudinal factor analysis", that is using a single-group analysis with one factor per time point (I would not use a multiple-group approach since you have the same people at the two admin's, so not independent samples). The standard setup is to hold thresholds and loadings invariant across time to the extent that is realistic, and let the factor variances be different across time (with one loading =1 to set the metric), having a zero factor mean at time 1 and free at time 2, and letting Delta =1 for time and free for time 2 (see Mplus Web Note #4). Then estimate factor scores from this model.

Or, do the above to get thresholds and loadings for each admin and then run each admin separately with parameters held fixed at the solution from the joint analysis, and estimate factor scores for that admin. This approach perhaps is somewhat less prone to misspefication of across-time relationships.

Thanks for the suggestion. I ran these models and I am sure there is something I still do not 'get' about the use of scale factors. I will re-read Web Note #4 more carefully.

I actually have more than two administrations of this questionnaire: only one of the administrations differs from the others and needs to be linked. Running each repeated adminstration seperately, I find that if I do not constrain both the psi and delta matrices, the estimated factor scores for identical response patterns are not equal across time (when the items are the same, and lambda and tau are also constrained to be equal). I find this sample-invariant scoring intuitively pleasing and seems to be consistent with the IRT model.

Having different Psi matrices (and therefore different Delta) influences the factor score estimation even when tau and lambda are the same. Psi is the "prior" factor cov matrix and therefore should have an influence. Substantively it seems like Psi can change over time and this should be allowed even when tau and lambda remain invariant.

I am doing two group CFA with one factor. The indicators consist of some binary items and some ordered polytomous items. If I wanted to refer to the Mplus factor scores in IRT terms, would I say they are similar to or the same as IRT graded response scores? Thanks.

So if items are all ordered polytomous then its like graded response? All or part binary items then its 2P normal ogive MAP? In that case item scores other than 0, eg 1 2 3, are treated as 1 to compute probit coefficients?

I am estimating a Graded response model(Samijima, 1979) in Mplus. The strucutre of the scale turns out to be multidimensional, (4 factors). I wonder in this case, can I still using the same tranformation, i.e, a =loading /(sqrt(1-loading**2)); b =threshold/loading, to convert the Mplus estimates of thresholds for an item into IRT model's b? and the same way for a?

unfotunately I do have three items load on two factors. therefore, shall I use: b1 =threshold1/(sqrt(1-(var(f1)*lamda_f1**2+var(f2)*lamda_f2**2+2lamda_f1*lamda_f2*cov(f1, f2))) as the transformation for these three items?

A correction for my last submitted inquiry: for the three items that load on two factors, the a_f1 = lamda_f1 /(sqrt(1-(var(f1)*lamda_f1**2+var(f2)*lamda_f2**2+2lamda_f1*lamda_f2*cov(f1, f2))) and a_f2 = lamda_f2 /(sqrt(1-(var(f1)*lamda_f1**2+var(f2)*lamda_f2**2+2lamda_f1*lamda_f2*cov(f1, f2))); as such, for these three items, each will have two sets of transformed thresholds, one set for factor 1, b1_f1 =threshold1_f1/lamda_f1,... and one set for factor 2, b1_f2 =threshold1_f2/lamda_f2, is it correct?

First, am I correct by saying that Mplus only models the means, variances and covariances, and not the higher-order moments? In this case, the estimation of IRT-models via Mplus is not full-information, but a good approximation instead.

Second, is it still not possible to estimate the ratingscale model or the partial credit model (with a probit link instead of a logit link) with Mplus? I ask this because the possibilities of Mplus grow rapidly, and the previous question about this topic is dated in 1999.

The current version of Mplus uses weighted least squares with a probit link. This is not a full-information estimator. Version 3 will include a full-information maximum likelihood estimator with a logit link for categorical outcomes.

I am not familiar with what the rating scale model or the partial credit model is. If you can explain it in a simple way, I can try to answer this.

The ratingscale models is a model for ordered polytomous data. It states that: ln [ P_vij / P_vi(j-1) ] = Theta_v - Beta_i + Tau_j With v=person, i=item, j=category, Theta is a person parameter, Beta an item parameter, and Tau a category parameter. As such, this model assumes equal distance between two categories for all items.

The rating scale model is an extension of the ratingscale model in that it relaxes the assumption of equal distance between categorries: ln [ P_vij / P_vi(j-1) ] = Theta_v - Delta_ij with Delta an item and category specific parameter.

With ML estimation, Mplus uses a logistic regression of the ordered polytomous item on the factor, where the logistic regression model is a proportional-odds model in the language of Agresti's categorical data book. I am told by IRT experts that this is Samejima's model. I haven't seen this spelled out in writing, but you can check Agresti and compare to Samejima.

Just in follow-up to the graded response question. I've compared estimates from MULTILOG and Mplus in three different data sets and they are essentially in agreement as follows: MULTILOG's A = Mplus' standardized loading; MULTILOG's B(k) = Mplus' k'th threshold/standardized loading. The relationship for A looks different to what has been stated in some earlier questions/answers: is it different?

Re: Mplus/MULTILOG. I think I haven't understood the answer to "Anonymous Tuesday, December 16, 2003 - 10:07 am" which describes a different transformation from Mplus to graded response. But to be clear: a 1-factor model for polytomous variables in Mplus fits the equivalent of a logit form proportional-odds? And since MULTILOG does a logit version of graded response the above simple relationship holds? (And Mplus can/can't do normal ogive versions?)

I have a question about generating the scaling factor for an IRT analysis. My items are categorical - with 4 response options and I have 15 total items. How do I generate a scaling factor for each of the 15 items?

I have read through this discussion and I think I've figured out how to generate the a and b parameters - but does the newest version of Mplus allow for graphing of the IRT curves? If so, how do I do that?

A quick question. As I am reading all the conversations for IRT, I am left needing some clarification. Mplus will perform Samejima's model similar to Multilog? However, Mplus version 3 does not perform IRT for rating scale data? I am I correct with these?

By rating scale, I mean a scale that is strongly agree, agree, disagree, and strongly disagree. I know that Mplus can handle this type of data in other formats (i.e., SEM), but will it work for an IRT model using the Likert-Type format?

I have noticed that person abilities estimated by the MLR method in Mplus are continuous while I expected discrete values like MLE or WLE ability estimates. What causes this difference in the MLR method? Do you have a reference where I can find information about this? Many thanks in advance.

My understanding is that if the model is a logit model and with constrained variances a Rasch model, then it is in the exponential family and the student raw scores are sufficient statistics. Therefore there should be a one-to-one match between abilities (factor scores) and raw scores, but that is not happening in Mplus.

I am using MLR with a dichotomous items and I am constraining the item loadings to 1. My understanding is that this will result in a Rasch model with a normal prior. My understanding is also that in this case the raw scores and the ability estimates will have a one-to-one match, the metrics will be different and the transformation from one to the other will be non-linear, but nevertheless there should be just one estimate for each possible raw score. This does not appear to be happening and I am not sure why. Do you have an explanation? What do you recommend as the best reference for understanding the MLR estimation algorithm in this context?

To answer this I would need to see your Mplus output and your data. Please send them to support@statmodel.com.

Just to make sure, you should allow the loadings to be equal when fixing the factor variance to one, not fixing the factor loadings to one. If you fix the factor loadings to one, you should allow the factor variance to be free.

I am wondering how to incorporate estimates from one item-response model into a second item-response model and still get good standard errors. More specifically:

“Stage 1” is a multilevel item-response model for individual’s ordinal responses to questions at time 1. Level 1 = person and level 2 = items within person. This will give coefficient estimates and cut-points from which we can obtain probabilities of an individual scoring between any two cut-points.

“Stage 2” would be a multilevel item-response model for individual’s responses to questions at time 2, but this time we would like to incorporate estimates from “stage 1” as predictors.

Any suggestions on how to model this in Mplus? Thank you so much for your help. Laura Piersol

To add to this discussion, it sounds like your "cutpoints" are thresholds for ordinal outcomes and perhaps your "coefficient estimates" are the loadings (discriminations). If so, Linda's 2-factor suggestion refers to a longitudinal factor analysis with 1 factor at each time point. Instead of having stage 1 estimates as "predictor", the idea is then to hold the thresholds and loadings equal across time. Perhaps this is something you want to do - assuming we have understood you correctly.

I'm a new user, so I'm still learning the program. I'm trying to generate an IRT analysis of my data (1 latent trait explaining 10 categorical variables) modeling my program on example 5.5 from the manual.

Can Mplus generate Item Characteristic Curves using the Plot command? (The program gives me the options only of Histograms, Scatterplots, Sample Proportions, and Estimated Probabilities, none of which produce ICCs.)

Thank you for your 3/8/05 responses to my question. I have been trying to get a better handle on these types of models, including factor analysis and IRT as I am relatively new to these topics.

In regard to holding the thresholds and loadings equal across time- We are following a group over two time points. We are happy to assume that the thresholds are equal across items for each time point. However, the set of items at the two time points are not identical so equating thresholds doesn’t seem correct. In this case, would you think of the first factor as a “predictor” of the second factor? This is the model I have in mind: MODEL: f1 BY u1-u7 f2 BY u8-u14 f1 ON x1-x10 f2 ON f1 x11-x20 Am I missing something?

I also would like to confirm that we don’t need to run a multilevel model. In terms of variables, we have individuals’ responses to the items and individual-level covariates. From reading the Mplus documentation, it sounds that this can be handled as a single level model.

Do you have a literature suggestion for better understanding the intricacies of these models and interpreting the output?

If I understand you correctly now, you do not have the same items at two timepoints but different items at two time points representing two different dimensions. Then there is no need to hold thresholds and factor loadings equal. You would want to hold them equal if you have the same items at two time points representing the same dimension. The equalities specify measurement invariance. I think your MODEL command looks good given what you want to do.

If you have no clustering in your data, that is, children were not sampled from classrooms, for example, then a single-level analysis is appropriate.

I don't know of any one piece of literature but a good SEM book would probably help. See our website where there are a plethora of refereces. I think many like the Bolen book. Maybe someone else can make a suggestion. There are also some papers that compare IRT and SEM.

on Wednesday, March 09, 2005 - 11:48 am Anonymous asked about plotting ICCs in MPlus. While I haven't found a way of doing it within MPLUS It is relatively easy to plot ICCs in your favourite graphical program (I like gnuplot but I've done it in excel as well). In gnuplot you can use the norm() function with the reparameterised estimates (see above on this page for how to reparameterise) and plot y = norm(a*(x-b))

Some time back I asked about plotting test information functions. For the benefit of others here is what I've found. Hambleton & Swaminathan (1985) got me off to a good start but Frank Baker's online book on IRT had all the answers I was looking for.

The essential points are

1. The test information function is the sum of the item information functions

2. The item information function for the 2 parameter logistic model is

I(theta) = a^2 P(theta) Q(theta)

where P(theta) = 1/(1+EXP(-a(theta-b)) and

Q(theta)= 1 - P(theta)

a and b being the discrimination and difficulty parameters, and theta the latent "ability"

(see Baker, 2001 eqn 6.3 on p 106)

Note that for simplicity I've left the i subscript off these formulae.

I'm yet to find the item information function for the two parameter probit model.

Thanks again for the excellent software.

Andrew Baillie andrew.baillie at mq.edu.au

References

Baker, Frank (2001). The Basics of Item Response Theory. ERIC Clearinghouse on Assessment and Evaluation, University of Maryland, College Park, MD. http://edres.org/irt/baker/

You should work with the unstandardized values. But note that with WLSMV you have probit, not logit results. This means that you have to involve the factor 1.7 in your slope/loading comparison with BILOG.

Here is a webpage with a number of references -http://www.psychology.gatech.edu/unfolding/Publications.html. I'm not sure if these references all apply to unidimensional unfolding or not but I believe it has been applied multidimensionally. I am just starting out in this literature, so I may run across more recent articles. If so, I will pass them along.

If the Mplus framework can incorporate these types of models, it would represent a significant advance in the field of MDS, allowing MDS and unfolding solutions to be more rigorously tested and incorporated into more general latent variable models. I believe that the possibility of state-of-the-art missing data handling and the use of complex sampling designs would be a major advance, as well.

For ML estimation of single-factor models for binary indicators, Mplus Version 4 gives results not only in the regular factor analysis metric but also in the metric of the classic 2PL model with difficulty and discrimination estimates. The classic IRT estimates are given in the usual (0,1) metric for the factor.

The relationship between the factor model parameterization is given in Day 3 of our course handouts. This will also be posted this week as part of the new Web Note #10. The factor model estimates are reported both as raw estimates and standardized and the choice - or reporting both - is up the the user.

For the problem you are having with your 3-factor model, to make a diagnosis we need you to send input, output, data and license number to support@statmodel.com.

The only relevant IRT references I know are listed on our web site under References, Categorical outcomes, IRT - see also the forthcoming Web Note #10.

Last year I submitted a scale validation manuscript to a journal. The centerpiece of this manuscript was a confirmatory factor analysis conducted in Mplus 3 using WLMSV estimation. The input items were 136 binary personality inventory items. Previous research using PCA and EFA methods suggested three second-order factors and 17 first-order factors, so we fit that hypothesized factor structure to our data. We have some 6,000 research participants, which we randomly split into two samples, an initial model validation sample (which we used to obtain a "brief" 48 item pared down version of the original instrument) and a cross-validation sample on which we successfully refit the factor structure from the first sample.

A reviewer of the manuscript has called into question our use of factor analytic methods, arguing that we should instead use IRT methodology. The reviewer states, "To reduce the number of items measuring the three clinical dimensions of the 136-item inventory should be performed according to modern psychometrics outside the frame of factor analysis, namely with item response theory models. In this respect, the authors should consult, e.g., Borsboom, D: Measuring the Mind. Conceptual issues in contemporary psychometrics. Cambridge University Press 2005".

I have read the Borsboom book as well as an earlier paper he published in 2003 that delineates some of the philosophical conundra involved in using latent variable models to infer the presence of latent factors from correlation matrices. Given how enthusiastically the reviewer endorsed IRT over CFA, I was initially surprised to see that Borsboom's criticisms seem to apply with equal force to IRT and CFA/SEM models. Paul Barrett made mention of this on SEMNET in March of 2005, citing the following paper:

When I read through the SEMNET archives and this discussion board, as well the helpful posts in the new IRT sections of the Mplus Web site, I found myself less surprised given how closely related the two methods appear to be, with indentical results possible under some conditions (e.g., ML estimation of models cotaining a single latent factor).

In crafting my response to this reviewer's comments, it would be helpful for me to know the scope of available IRT models in general and what is available in Mplus. First, to your knowledge, is it even possible to fit higher-order factor models within the IRT framework? If it isn't, then clearly IRT would not be a suitable tool for our purposes given that our theory clearly stipulates a higher-order factor structure a priori. On the other hand, if it is in fact possible to fit higher-order latent variable models under the IRT umbrella, is it feasible to do it using Mplus? I'd guess that if one of the requirements is ML estimation, then the answer is probably "No" because of the computational burden involved with this many variables and subjects.

Finally, given how closely related the factor analytic and IRT approaches are, even if it is conceptually possible to fit a higher-order IRT model and it's computationally feasible, is it even worth bothering to recast the analyses in this manner given how similar the IRT and CFA results are likely to be? My intuition tells me that at such large samples the WLSMV estimates originating from Mplus would be unlikely to differ markedly from those produced by an IRT model. What do you think?

As always, references and any additional comments (including any thoughts you have on the overall utility of what can be learned from fitting CFA models to tetrachoric correlation matrices in scale validition studies) are most welcome.

I am disappointed that there are still some journal reviewers who do not understand the relationship between factor analysis of categorical outcomes and IRT - that it's all the same. It's been a long time now since articles like

were published and long before then it was clear that this is all the same.

Perhaps the early focus on correlation matrices in factor analysis throws people off. But that should be seen merely as a matter of estimator choice, not model choice. Tetrachoric correlations belong with (weighted) least-squares estimation of limited information from first- and second-order moments, whereas with ML you work with the raw data (full information from all moments). The model is the same, however - if you assume normal factors and probit regressions, you fulfill the assumptions of underlying normality for continuous latent response variables that tetrachorics rely on. This is IRT's 2-parameter normal ogive model. Going to 2-parameter logistic is a trivial model variation. Both the IRT and the factor analysis traditions now work with multiple factors, although I haven't seen explicit use of second-order factors in IRT - mostly because IRT uses ML almost exclusively and the necessary numerical integration is heavy in situations where second-order factors are used - namely with many first-order factors. Programs like Bock-Muraki's TESTFACT is limited, I think, to 5 dimensions. For the same model, least-squares and ML estimation typically give very similar results as already the 1986 JEBS article by Mislevy showed.

showing that both least-squares and ML techniques are available for both probit and logit. ML can in principle be used for the same complex models as least-squares, but is again limited by computational demands with many first-order factors.

Note also that Mplus can do IRT modeling in mixture (latent class), multilevel, and multilevel mixture situations.

I was interested in looking at comorbidity between two disorders using an IRT framework. However, I am running into a problem in examining the model. I want the variable to be coded as 0 = no diagnosis; 1 = diagnosis a; 2 = diagnosis b; and 3 = diagnosis a AND b. If I treat the variable as an ordered categorical variable, the model works by treating the 4 level variable as a graded response. However, there is no reason to think that diagnosis b is more severe than diagnosis a, which is implied in the graded response model.

u1-u4 reflects diagnostic status for disorder A and u5-u8 reflects diagnostic status for disorder B. I am not interested in gorwth per se, but I am interested in examining some aspects of measurement invariance. Is it appropriate to iterpret the threshold parameters in the same way that you would if there was only one factor?

Sample size depends on many things including the model, reliability of the data, scale of the dependent variables, etc. To know for sure how many observations you would need, you could do a Monte Carlo simulation study.

I am fitting an MIMIC model with binary indicators loaded on a single factor, and several DIF effects are detected. I know that the parameters in MIMIC model can be converted into the parameters for a 2-PL IRT model (refer to Dr. Bengt Muthen's response on May 29, 2006), but I do not know how to get them in Mplus.

I have 3 questions: 1. What are the corresponding commands in Mplus in order to get these converted parameters and their standard errors under this situation (single factor, binary indicators, ML estimation)?

2. Does it make any difference if the estimator is not ML, e.g., WLSMV?

3. How about an MIMIC model with multiple factors, binary indicators and several DIF effects?

I apologize if this is a redundant question, given previous posts, but I am still relatively new to both the IRT literature and MPlus. I am fitting IRT models to binary data. I have no problem fitting the Rasch and 2PL models, unidimensionally. However, I believe that I need to fit a multidimensional IRT because I think that there are two factors that would represent my dataset. Am I correct in thinking that I can use the same basic code for a multidimensional IRT as a unidimensional IRT? In doing so, I have merely added the second factor into the syntax. The model runs, however I only get thresholds, rather than difficulties and discriminations.

Am I fitting the model correctly? If so, is there a way to get discriminations and difficulties out of my output?

I apologize for this Stat 101 question -- I have looked at many references referred to in this (and other related) discussions, but I have not found the explicit answer:

How exactly do you convert the probit lambda to a logit lambda, and a probit threshold to a logit threshold? I assume it's not a simple multiplication by the 1.7 conversion factor?

I had attempted to figure this out on my own from running the same data two ways: with the link=probit for one run and link=logit for another run, both using MLR. I was hoping to see if simple multiplication seemed to work (I had set the threshold@.71 in the link=probit run, and set the threshold to 1.2 in the link=logit run). But I was further confused when in both runs, the first threshold (item difficulty) in the IRT parameterization sections was shown to be set to -1 in the output for both. I expected it would have been the same as the link=logit value of 1.2.

Sorry, please answer the first question, but let me fix my second question:

My second question should have said:

I attempted to figure some of this out on my own from running the same data and model (single factor model with 60 binary items) two ways: with the link=Probit for one run and link=Logit for another run, both using MLR.

What I assumed from Bengt's response Dec 6, 1999, was that the conversions (mentioned Dec 2,1999) from FA to IRT parameters are for converting from the link=Probit FA parameters to the Probit IRT parameters. These were: i) IRT discrim/slope: (a)=lamda/(sqrt(1-lamda**2)) and ii) IRT difficulty: (b)=(threshold)/(lambda)

But, I can't seem to convert the Mplus FA output to the Mplus IRT output regardless of the link I use.

So, Question 2: How can I convert from my Mplus link=Probit FA thresholds and lambdas to the IRT discrimination and difficulty estimates Mplus outputs? I have: lambda1 set to .71 threshold 1 set to -.71 Mplus IRT discrim (a1) output=.71 Mplus IRT difficulty (b1) output=-1.0. (the IRT output says the parameterization is Probit)

In deciding if observed item scores are continous or ordered categorical, would it be appropriate to run CFA models (one specifying continuous indicators and the other specifying categorical (4 response levels)) and then compare the BIC values?

I am conducting an ordinal CFA with the wlsmv estimator. How do I obtain output of parameter estimates in IRT metric that is now available with v. 4.2 of Mplus? I cannot find IRT metric estimates in my output nor can I find the command to request them in the Mplus documentation.

I am currently running a Graded Response Model with 836 participants over 80 items. There are 5 dimensions and is measured on a 5-point Likert scale. The model is running on a server with substantial memory (16GB) and disk space (60GB). I'm using 10 integration points in code, and once the model began I received the message that, "THIS MODEL REQUIRES A LARGE AMOUNT OF MEMORY AND DISK SPACE. IT MANY NEED A SUBSTANTIAL AMOUNT OF TIME TO COMPLETE..." The MS-DOS window indicates that the total number of integration points is 100,000. Currently, it has been running for close to 30 hours, and not having run these multiple times, I wasn't sure if the model is indeed running, or if it's 'stuck' (as can happen in LISREL and SPSS). Any suggestions would be appreciated! Thanks!

It's probably still running but this is not a realistic analysis with so many integration points. You can change the integration to Monte Carlo integration (INTEGRATION=MONTECARLO;). Alternatively, you can use the weighted least squares estimator, WLSMV.

Folks- I'm comparing Parscale and Mplus (v3.1) 1PL and 2PL models with ML estimation. I obtain the same difficulty and (for the 2PL) discrimination parameter estimates across the two programs (after using the conversion equations in webnote4).

However, I find that the distribution of the theta (factor) scores differ across the two programs. Parscale provides approximately normal theta values, while the factor scores in Mplus have a std deviation of .91 in the 1PL and .93 in the 2PL. The latent scores themselves are correlated .99 across the two programs and are centered at 0, but are just distributed differently. Is there any reason that you can think of why Mplus provides scores that do not have a st dev of 1? Or am I missing something?

One reason might be that there are two ways of estimating these factors scores _ EAP and MAP. Mplus uses EAP. Perhaps Parscale uses MAP. Also, the variances of estimated factor scores do not in general agree with the maximum likelihood estimated factor variances due to shrinkage. A third reason could be that there was a problem in Version 3.1. ML for categorical outcomes was introduced in Version 3 and there have been many changes since then. I can't think of any problem offhand but there may have been one.

My question is, what is the parallel to getting the IRT standard error of measurement for the scale scores? Can this be output per each subject's estimated Factor Score, or per Factor Score value (if using ML/MLR or WLSMV)?

The factor scores you get with categorical items and continuous factors when using ML estimation in Mplus are the "theta-hat" scores you obtain in IRT (using the Bayesian Expected A Posteriori approach). They will be on a z-score scale if you have the factor variance fixed at 1, freeing all the loadings (the first is otherwise fixed by default and the variance free).

IRT standard errors of measurement are typically expressed as the inverse, namely the information curves for items and sums of items. You can request information curves in the Mplus plot command. See also the web site description of IRT in Mplus.

After a full day of reading MPlus discussion boards, web notes, and a few other articles about how IRT operates in MPlus, I wanted to make sure that my understanding was clear on a few points:

1. If I am interested in running an IRT (2PL) model, on a 16-item unidimensional measure with 5 response categories for each item (i.e., items are ordered categorical), while testing for DIF across gender, do I start by running a CFA using the 2-step procedure outlined on p. 399 of the MPlus User's Guide for Multi-group invariance testing with categorical outcomes (using WLSMV and delta parameterization). Correct so far? 2. Am I correct in my understanding that single-df chi-squared difference testing in WLSMV (for individual parameters such as item 1's factor loading or first threshold) will help me determine statistical DIF based on gender (assuming sufficient power)? 3. Does MPlus have other statistical tests for DIF? I have read about CFI change tests in the invariance testing literature, but I was not sure if these had made it into MPlus.

4. To discuss the difficulty in IRT terms, the thresholds can be converted to traditional IRT difficulties (i.e., b) using the simple formula b = threshold/factor loading. Then if I wish to convert it from a probit scale to a logit scale (commonly used in PARSCALE and other IRT programs) I need to multiply the number by 1.7 or 1.76. Correct? 5. Can I calculate difficulties for each threshold using this formula? Since each item has 4 thresholds, I plan to calculate a difficulty for each threshold (as number of responses per category allows). In other words, does the basic formula (b = threshold/factor loading), which seemed to have only been talked about for dichotomous items in everything I could find, generalize to polytomous items? 6. To convert the loadings to IRT slopes (i.e., a; item discriminations), do I use the basic formula a = factor loading/(sqrt(1-factor loading^2)), and follow the same multiplication procedure (by 1.7) if I need to convert the probit to a logit. Correct?

1. Yes, or to test only item difficulty DIF, you could use gender as a covariate in a single-group analysis, looking for direct effects (we teach that at Hopkins in our March course). See, e.g., the Muthen 1989 Psychometrika article (on my UCLA web site).

2. Yes.

3. Just multiple-group tests or test of direct covariate effects.

4. Right. Or use ML with logit link right away.

5. Yes, I think so.

6. Yes, or let Mplus do the conversion (given in the output for single-factor models)

These matters will be discussed during day 2 of our upcoming Hopkins course in March.

With polytomous items it seems somewhat arbritrary to me as to how many possible values each item can take on before we consider the assumption that the items are continuous to be reasonable. That is, if we let subjects use a 100 point response scale for each item, probably few if any would object to analyzing the items assuming they are continuous but there seems to be little principled reason for a priori considering a 100 point response scale to be continuous and a 5 point response scale to be categorical. Thus, when I have 5 point response scales I usually analyze the data both ways and often find that indices of fit such as CFI look a lot better in the IRT approach. My question is whether there are any valid tests of whether the apparent increment in model fit resulting from treating the items cateorically is statistically significant?

I think there have been some studies that suggest with five or more categories and no floor or ceiling effects, treating a categorical variable as continuous does not make much of a difference. If the categorical variable has floor or ceiling effects, the categorical methodology can handle that better. You could do a Monte Carlo simulation where you generate categorical data that look like your data and then analyze them as continuous variables in one analysis and categorical variables in another analysis and see if one way is superior in reproducing the population values.

thanks Linda, we might just do a simulation study of this (we already have one planned in which we test whether treating a categorical variable as continuous makes much difference when estimating omega_hierarchical) and I am sure that would be helpful. Apart from a simulation study though I am wondering if it would be legitimate to conduct a test of the difference in fit for an individual data set. I don't know enough about IRT and categorical data analysis to know the answer to this question. My guess is that there are no legitimate tests for this purpose. When I run the same model with the items treating as being either categorical or continuous, the chi-square values and dfs seem to be on different orders of magnitude (e.g., in one model in which the items are treated as being categorical the model df = 191 whereas the exact same model with the items being treated as continuous the model df = 2137.

I would need to see the two outputs you are comparing to comment but you are comparing different estimators and models with a different number of parameters. In addition, if you are using WLSMV, the degrees of freedom do not have the same meaning as with ML for example. The continuous model will contain linear regression coefficients while WLSMV will provide probit regressions. With five-category items that have no floor or ceiling effects, I would expect similar p-values for the chi-square test of model fit and also similar ratios of parameter estimates to standard errors (column 3 of the output). If these are not similar, then I would use the categorical methodology.

right and I am assuming given the different estimators that and the different meanings of the dfs that there is not any test that could be validly used to compare the difference in fit (in the particular example I am referring to, the p values are indeed the same and the ratios of parameter estimates are similar for the most part though in some cases the values are as different as say 6.9 vs 9.8 and the RMSEA estimates are highly similar - .053 vs .042 - but the CFIs seem meaningfully different to me - .937 for the categorical model vs .851 for the continuous model). P.S. thanks for the very speedy replies!

Does Mplus provide callable built-in functions for cumulative distribution function of standard normal, erf and integration where we can input the arguments? I am doing my masters project and my supervisor recommended me to use Mplus. However, both of us are extremely new to Mplus and we are still learning from scratch.

I am not sure how to interpret the results as to which parameterization the first block of the output uses (the one directly under MODEL RESULTS) i.e. what is the relationship between the estimates in Item Discriminations and those immediately under MODEL RESULTS.

The information under Model Results shows the results from the estimation of the model. The information under IRT PARAMETERIZATION is a translation of the results into the IRT parameters of discrimination and difficulty. See IRT under Special Mplus Topics on the website for details about the translation.

For polytomous item models, where IRT parameter conversions are not provided, is there some difficulty or issue to be aware of? The discussion above refers to manual computation, but I also assume if there were no issues MPlus would just do it.

Do I convert the multiple thresholds to multiple difficulty parameters using the same conversion as in the dichotomous case?

Dear Bengt One thing is puzzling me. When obtaining test information curves for a simple 1 factor model, results depend on the link used with ML estimation. With logit link, values are just over 3 times larger than for probit link. The shapes of TIC are pretty much the same. Is this to do with the scaling constant 1.7? But when squared it gives 2.89, not 3? Which is the "correct" information? It is important for estimating standard errors. In my model factor variance is fixed to 1 and factor loadings are free.

I have been doing some IRT modeling for the purpose of a large scale assessment development, and have found some interesting things I was hoping to get clarification on. I generated a 2PL model in MPlus using maximum likelihood using the logit link, and then a 2PL model in BILOG, and 2P normal ogive model in BILOG. When comparing the three methods, the MPlus results are nearly identical to the normal ogive results in BILOG, but are nowhere near the logistic results in BILOG. Though the correlations are of an expected magnitude (1.0), the difference between them vary greatly. Further, the discriminations are quite different (MPlus = 1.24, BILOG-ogive = 1.27, BILOG-log = 2.02). BILOG uses the marginal maximum likelihood, but why do 2PL results from MPlus match BILOG normal ogive and not the BILOG 2PL model? Thanks!

If you are using maximum likelihood and the logit link, you should get the same results. This is not the default so you would have to specify it in the ANALYSIS command. The difference may be due to them using or not using the constant of 1.7 in their computations and Mplus not doing this. Mplus gives the results in IRT metric as well, using the 1.7 constant. If none of this helps, send the files and your license number to support@statmodel.com.

I am using MPLUS 5 to run a 2 parameter IRT. I used the graph feature in the progam to plot the overall ICC curves as well as the group (gender) differences in the curves. Can the size of the plot lines and symbols be modified? If not, is this something that is being worked on? It would certainly improve the look of the plots that are generated.

Also, is there a way to import the labels and titles save in a previous plot to a new graph? These are some thing that would improve the user friendliness of the software.

The size of line cannot be changed. Symbols can be changed by using the Line Series option under Graph menu. At the present time, labels and titles cannot be saved. This is on our list of things to add.

I am a novice to IRT but trying to examine item endorsement invariance across gender on a test with 10 dichotomously scored item (y vs. n). I am currently using mplus 5. First I ran a cfa to confirm the unidimensionality of my scale and then I ran a second model that included gender as a covariate (see codes used below). Is significant DIF indicated simply by the significance of my covariate on the item as my indicator?

A significant direct effect of a covariate to an item represent DIF. In your example it would be, for example.

j ON gender;

You may find the slides that discuss measurement invariance and population heterogeneity from our Topic 1 course handout helpful. Also, the Topic 2 course handout contains information specific to categorical outcomes. The video for these topics is also available on the website.

My follow-up question relates to the issue of having a covariate that is a 5-level nominal variable (e.g., 5-level age group: 1=18 - 24, 2=25 - 34, 3=35 - 44, 4=45 - 64, 5=65+ with group 2 as my referent category): How would I plot the the curves to look at DIF for across age groups in comparison tot he referent category. I know I would create 4 dummy variables with the referent category left out. However, when I try to plot the relationship I cannot figure out how to get plot for the referent category. Can you make any suggestions?

The Item Discrimination and Item Difficulty information is no longer included in the MPLUS output. MPLUS gives me the OR indicating significant or non-significant DIF for the age groups relative to the referent category (age2, which is left out). Is there a way for me to obtain the Item Discrimination and Item Difficulty information for each level of the covariate?

Hi Linda/Bengt; As Linda suggested, I downloaded the slides that discuss measurement invariance and population heterogeneity from your Topic 1 and 2 course handout. On slide 161, in discussing the interpretation of the effects it was concluded that shoplift was not invariant. Would this also be true if the direct effect of gender on shoplift was statistically significant and positive (instead of negative) but all other effects remained the same as in the slide? That is, as expected, for a given factor value, males had a higher probability of shoplifting than females. I am assuming that it would be but this is not clear from what is written in the slides.

Another question related to the calculation of the item discrimination and item difficulty for different levels of my covariate (Mell Mckitty posted on Thursday, October 16, 2008 - 1:32 pm ), would I use the model estimates or the standardized estimates? Which is the alpha and PSI value in the MPLUS output?

Another related question: How do you determine if DIF is uniform versus non-uniform? I am assuming that the inclusion of an interaction term would work. That is, a significant interaction term would indicate non-uniform DIF. Is this correct?

A colleague and I are attempting to construct and interpret a polytomous item response model.

I want to make sure I am obtaining the slope and category thresholds that are commonly reported and are consistent with what you would obtain through running a graded response model in Multilog.

Following example 5.5 in the manual, I begin by designating my estimator as Robust Maximum Likelihood. I continue by specifying the variance of the latent construct to 1, and make sure I am using the logit link.

To obtain the thresholds, I take the logit thresholds reported in MPLUS and divide by the standardized factor loadings?

To obtain the slopes, I take the Standardized Factor Loadings and divide by the square root of (1-factor loading^2).

Once an interaction term is in the model, the plots of the ICC are no longer available in MPLUS. If a statistically significant interaction term is observed, thus indicating non-uniform DIF, how would one go about obtaining this plot in MPLUS.

Thanks for your quick reply. Is the use of the interaction term as indicated in my model (Mell Mckitty posted on Wednesday, October 22, 2008 - 8:16 am ) a sufficient way of assessing for non-uniform DIF?

Bengt/Linda, I ran a series of mimic models 1. without covariate 2. with binary covariate + interaction term to rule out non-uniform DIF, if the interaction term was not sig 3. a model with the binary covariate I tested the formulas from the technical notes by replicating the item discrimination (a) and item difficulties (b) printed in the MPLUS output for model 1 - minor differences between the calculated b's and those from MPLUS. I am now trying to use the formulas to calculate a and b for the different levels of my covariate (Model 3) for the item with DIF: How would the formulas be modified to do this? I think that for the item of interest I would take the following information from the output of Model 3: lamda = unstandardized estimate; tau = unstandardized threshold; alpha = estimated means for the latent variable from TECH4;

I am not clear on the value of psi: 1. is psi = the residual variance for the latent variable from the model output or the covariance estimate for the latent variable in the TECH4 output?

2. how does the estimate of the effect of the covariate fit into this formula?

3. If uniform DIF is present I believe that a would be constant across different levels of the covariate but that b would vary so, the estimate of the effect of the covariate should affect tau. Is this correct? and how?

Is there anyway to specify an interaction term in a CFA with covariates when the WLSMV estimator is used? I tried but keep getting an error message. If not, is it sufficient to use the multigroup method (i.e., grouping is (1=male, 2=female))?

I want to compare the 2 sets of difficulty parameters per item (e.g., the item_A parameters when flfn_A=1 and flfn_A=0.) I was somewhat unclear as to how to apply the Muthen et al. (1991, p. 10) formula to compute these, or whether computation would differ when using ML estimation. Would I simply (a) use (item threshold - regression weight)/item loading for predictor=1 and (b) use item threshold/item loading for predictor=0?

Secondly, I wonder whether effects of the item predictors must be modeled on both F as well as the items themselves. None of the predictor-factor loadings were significant, nor would I hypothesize that they would be. Rather, I would expect that item predictors only affect ability levels indirectly, through adjusting the likelihoods of correct responses. Could I constrain the loadings for F ON flfn_A-flfn_J to be zero or leave them out of the model completely, or must these paths be included for the model to be estimated/interpreted correctly?

The slope for the item regressed on the binary covariate simply shifts the threshold for that item (the slope is of opposite sign of the threshold), so all formulas we have in our IRT tech doc follow once you have computed the 2 threshold alternatives.

I would let the covariates predict f as well. It would seem that considering them all togher, there might be an effect on f, although each covariate has a small effect.

So D is taken into account in the calculation of a. Also, psi refers to the factor variance and angst@1 indicate that psi is set at 1. Therefore, equation 18 for the IRT techincal notes, taking into account the interaction term becomes,

a=((lambda +interaction*x)*sqrt(psi))/D, which is the same as 1 above.

Equation 19 with alpha=0 and psi = 1 and taking into account the direct effects of the covariate becomes:

My next queation relates to the calculation of the probability, which needs to take into account the indirect effects as well. So, how would the indirect effect of my model be incorporated into equation 17 (from the IRT Technical Notes) to calculate the probability P(Ui = 1|f)?

I know that the indirect effect (i.e., angst<-x) affects the mean of the latent trait (i.e., theta). As such, I am assuming that the indirect effect would be added to theta in equation 17. Is this reasoning correct? I have done this and the probability and related plot of the ICC curves for my indicated item for the different levels of my covariate seems correct but I would like some confirmation.

Sorry, I think I am getting a number of different questions all mixed up:

1. Correct, I wanted to know the equation to calculate angst on x and to plot the ICC curves.

2. I wanted to figure out how to identify theta from my output with mlr estimation with interaction included. I just went over your notes again and realize that theta is the residual variance, which gets printed out when standardized is indicated in the output. However, STANDARDIZED (STD, STDY, STDYX) options are not available for TYPE=RANDOM, and TYPE=RANDOM is necessary if an interaction term is specified in the model. Is there any other way to get the residual variance for my indicated item once an interaction term is specified in the model?

3. I am assuming that a combination of the direct, indirect and the interaction effects must be taken into account in the calculation of P(Ui = 1|f). So, How does the indirect effect fits into equation 17?

Regarding point 2., the residual variance parameter theta is not present in your maximum-likelihood estimation using the logistic form. That's why you don't see it in (1) and (2) of our IRT document that we are discussing, nor in (18) and (19).

I'm sorry, I can't go further than I already have on points 1 and 3 because it turns into statistical consulting which we don't have time for. Perhaps you want to discuss with your local statistical consultation center.

I need to calculate the first and second derivatives of the log-likelihood function with respect to the factor scores for an IRT model similar to ex5.5 so as to be able to calculate the maximum likelihood IRT scores and the information function. In conventional IRT, the first and second derivatives can be obtained by adding equations 1 and 2 respectively across all the items:

These calculations, however, do not yield the anticipated results when using the IRT parameters estimated in MPlus (version 5.1). Could you please inform me how to obtain these derivatives based on the parameters estimated in MPlus (e.g., as in ex5.5). Thank you.

Are you considering the "maximum-likelihood" estimator of the latent factor score "theta", or are you considering the "Expected a posteriori" estimator? With the ML estimator for the parameter estimates Mplus uses the latter estimator, which implies that a normal "prior" is used in the calculations. See IRT books.

I have a question about the definition of “linear” CFA vs “nonlinear” CFA.

According to MPLUS manual (Example 5.7 non-linear CFA), it seems that the term “nonlinearity” is defined in terms of how factors are specified (e.g., interaction, quadratic).

I think 2PL(Mplus example 5.5) IRT model belongs to nonlinear one-factor CFA. Am I correct? I tried to run 2PL IRT model with 60 binary items but it took really long time to get results. I changed estimator from MLR to WLSMV, which gave results relatively quickly. If I use WLSMV instead of MLR, can I still say that I am estimating IRT model? I heard that if I am using estimators other than MLR I am estimating 2 parameter normal ogive model (not logistic model).

With continuous outcomes a model is non-linear if it has non-linear functions of factors. With categorical outcomes, the model is always non-linear because the conditional expectation function (the item characteristic curve) is non linear.

2PL IRT with 60 binary items should go very quickly using ML because you use only unidimensional integration over the single factor.

WLSMV uses probit which in IRT language is the "normal ogive". This is still an IRT model.

which seems to be an analog of using IRT and multiple group analysis in order to test for differences between the a and b parameters (in a 2-parameter logistic version of the model) across groups. Could the DCF be computed in this way? Similarly, when they plot Test Response Curves (TRC), which are supposed to indicate DCF differences, they plotted Expected Raw Scores by the severity factor. Is it clear what these Expected Raw Scores should be? Thanks much in advance,

To follow-up on a post above , I sucessfully transformed the thresholds on my POLYTOMOUS IRT MODEL reported by MPLUS after employing the MLR estimator into those reported by MULTILOG using the (MPLUS Threshold/Factor Loading)=Multi-log threshold.

However, I'm off when I try to transform the standard errors. Is there something I'm missing? It looks like it should be a simple transformation. The Z-scores reported are close, but not spot on?

If you use Model constraint to descibe the transformation, you get the right Delta method SEs. Note that such SEs involve not only the SEs for the threshold and factor loading, but also their covariance.

I've read through the postings on conversion of factor analytic parameters to IRT parameters. In my case, I've run a multiple group CFA model with categorical observed variables and regressing covariates on the single latent factor using WLSMV with delta parameterization. The model uses the default constraint alpha=0, and loadings and thresholds for variables showing evidence of DIF in previous analyses are free to vary across groups. What I need clarification on is what values of alpha and psi I should use to convert the loadings and thresholds to IRT discrimination and difficulty parameters using equations 19 and 22 in the IRT technical appendix. Should I use the Tech 4 estimates or should I use alpha=0 and the residual variance estimates in the output?

Either way, I would get different IRT parameters for variables that were constrained to be equal across groups (i.e., noninvariant). Is this because I have regressed covariates on the latent factor, and would I just explain this when I present the results?

With multiple-group CFA (or IRT), the default is alpha=0 in the first group and free in the other groups (so not fixed at zero in all groups).

To get the standard IRT metric you would use the TECH4 means and variances for the factor.

But if you do that then your IRT curves will be different even thought the thresholds and loadings are equal - that's a function of the standard IRT metric using different standardization (different alpha and psi) in the different groups. So I would just use the alpha, psi standardization in say the first group and not the other groups - you can then see invariance in the item curves.

Thanks for the clarifications. When I ask for IRT curves using the Mplus graph option after running the model, are those curves calculated based on the approach you suggested where just the alpha, psi standardization from the first group is used to calculate the IRT parameters? When I use the graph option, I do get IRT curves that are the same across groups for the invariant items, but differ for the noninvariant items.

I am fitting a 3-factor CFA model with ordered categorical items using the WLSMV estimator. I let all factor loadings and item thresholds be free in the model and fixed factor variances and means to 1 and 0, respectively, for identification purposes. I want to fit information curves (IIC) to these items given the 3-factor structure.

1. Is that OK to do (given IRT analyses typically fit IICs for unidimensional models)?

2. I know it is feasible to do in Mplus, but I was wondering if these information curves are correct. That is, do they have the same interpretation as the IICs fitted in one-factor IRT graded response model.

3. And, would you tell me if there is documentation regarding this application of IICs (e.g. Mplus technical notes and citations/references)?

Mplus computes information curves also in the multifactorial case. The curve for items loading on a factor draws on the full multivariate information using the second-order derivative with respect to the factor in question. The remaining factors are substituted by their means. See also our IRT tech note:

Answer to Jen of March 10. I was being confusing - the translation to IRT parameter values uses alpha and psi to bring them to the N(0,1) metric used in IRT. The IRT curves that Mplus plots, however, use the Mplus factor parameterization and because what is drawn is the probability given the factor, the factor mean (alpha) and factor variance (psi) does not enter into the curve (only in terms of the location and range of the "x axis"). So, yes, invariant items will show up as invariant even across groups with different alpha and psi.

I am working on a latent growth curve model where the items for assessing the construct (social support) changed after the second wave of data. In particular, new items were added, and binary yes/no response scales were changed to 4-point Likert scales.

Thus the assumption of measurement invariance over time is surely violated. The only glimmer of hope I see is some form of IRT score equating across the different versions of the social support instrument. Probably this could not be done in Mplus, but in another program followed by importing the IRT scores for analysis in the LGC model. Any comments you might have on the reasonableness/feasibility of such a procedure would be greatly appreciated.

If there are at least some items that are repeated in the same format, there would be hope for equating - which could be done in Mplus in a single modeling step (unless data called for a 3PL). Otherwise not, I don't think. Changing from binary to 4-point scales can make a big difference I would imagine.

I am finding that a bifactor model fits my data best in many cases (child externalizing dimensions). But, I'm not clear, after reading several postings whether bifactor loadings and thresholds may be used to derive IRT disc and difficulty paramters in the same way they are used in the one-dimensional case- because items are loaded on multiple factors.

Question 2 If I derive factor scores in MPLUS from a bifactor model, what are the resulting factor scores analagous to, in terms of the info provided?

Would the factor scores provide theta estimated based on all factors (averaged out)? Or, can I derive a factor score that provides info on the general factor with specific dimensionality factored out? Is it possible to pull multiple factor scores when using multiple factors?

I am trying to correct a 4x4 correlation matrix of observed variables for attenuation and I think that I can do this quite easily with Mplus.

If I model each of the four observed variables as single reflective indicators of four latent variables (1 indicator per LV), and set the loading to 1 for each, wouldn't the correlation of the latent variables be the corrected correlation of the observed variables?

Linda and Bengt: Hello from an "old" friend! A colleague and I are using MPLUS to do a graded IRT model. We have four items each with four response categories. The most direct question is whether the results for the discrimination and threshold parameter estimates need to be rescaled, and if so, how? I ask because you do rescale estimates for 1 and 2PL models taking into account what I assume is the probit logit disctinction in estimation. But I see no such rescaling option for the graded response model. Without rescaling the results seem literally to be "off the map."

using the Samejima graded response model where D is chosen to make logit and probit close (1.7), a is the discrimination, theta is the "factor", and b are the difficulties.

You go from the Mplus results in factor metric to the IRT metric as follows. The Mplus IRT tech doc on our web site implies that when you run Mplus with the factor standardized to zero mean and unit variance (as is typical in IRT), a comparison of (1) and (2) gives

(3) a_j = lambda_j/D, (4) b_jk = tau_jk/lambda_j

Check if that doesn't get you results in a metric seen in IRT. You can do the translation (3) and (4) in Model Constraint using parameter labeling so a_j and b_jk get estimates and SEs.

I would like to ask a follow-up question regarding IRT parameter estimates for bifactor models. I am wondering if the item parameter estimates provided by Mplus are appropriate for multidimensional/bifactor models? That is, can I apply the usual IRT transformations of factor loadings and thresholds estimated with WLSMV, and does MLR produce the correct IRT parameterization on its own?

Also, do the plots of information functions have the same meanings as they do for unidimensional models [SE = 1 / sqrt (info)]?

You can use Mplus for bifactor models with categorical outcomes. Because you have more than one factor, you don't get the IRT translation but you can do it yourself by hand.

The answer is yes to your information function question, although the actual details are more complex. Mplus provides information functions also for multiple factors but the information function for a given factor depends on which factor value for the other factor that you consider. Because of this, Mplus lets you plot the information function for one factor at a value of the other factor that you choose (such as the mean). You can also condition on covariates, so this plotting is quite general.

I would like to ask a question related to the fit indices in IRT model. A colleague and I we are using Mplus to do a graded response IRT model.Our items have four response categories (Likert-scale). We are interested in the absolute fit of the model. Since there are problems with using the ML chi-square values to assess differences in fit between models, we decided to use WLSMV estimator to assess the model fit.

The output we got was quite a surprise (see below). We don't know why is our CFI value lower than TLI. Should we be concerned with this outcome? Do you have an explanation for it? Thank you for your time, Anna-Mari

These discrepancies can occur. See the Yu dissertation on the website for information about fit statistics and their behavior. This would make me suspicious about my model. Try alternative specifications.

On Wednesday, December 17, 2003 - 9:00 am, Linda stated that the formula

a=loading/sqrt(1-loading**2)

was only valid if there were no cross-loadings. Is that because it is scaling the loading by the variance not explained by the target factor instead of the variance not explained by any factor? If so, could you include that influence with this adjustment:

a=loading/sqrt(1-loading**2-loading2**2)

to scale the parameter by the overall residual variance? If not, then what is the correct formula when there are cross-loadings? Finally, could you point me an article that discusses this issue?

I have estimated a multi-group IRT (4-groups) using the MLR estimator. The means are set to zero in one group and allowed to vary in the others. Variances are constrained to 1. I first tested differences in thresholds for each item (there are 9) individually by comparing model fit. After accounting for all differences in thresholds I next tested differences in item difficulty.

My question regards significance of thresholds. 2 items have non-significant thresholds in one or more groups according to p-value and confidence intervals, although the discrimination parameters are significant. Is something wrong here? If so, how do I troubleshoot? If not, how does one interpret, if at all, a non-significant threshold?

I've just been doing this very same thing so have looked back at my output.

The p-values for the thresholds indicate whether they are signficantly different from zero. Assuming that you have binary data and are not modelling a guessing parameter, this just tells you that for this item and within that group, a positive/negative endorsement to the item is equally likely at the centre of your trait - you could verify that by plotting the ICC. i.e. certainly not something to worry about.

In your model I don't think the tests for thresholds are particularly informative, although the parameter SE's might be useful to get a better handle on how these parameters different across groups following your omnibus test.

I am a bit unclear on what you are doing here. When you talk about means and variance I assume this is for the factor. If you have multiple groups you want to test for measurement invariance. So assuming this is what you do, the factor variance should not be fixed at one in all groups.

And then you say

"After accounting for all differences in thresholds I next tested differences in item difficulty."

which confuses me because the item difficulties are functions of the thresholds.

What is the difference between ML and MLR for IRT models (i.e., assuming categorical data and full information estimation)? For continuous data, I understand that MLR is supposed to help adjust SEs and test statistics for non-normality, but if normality is not assumed for categorical data to begin with, then what does MLR have to offer over ML? I apologize if this topic is already addressed elsewhere, but I could not find it. Thanks in advance for any direction you can provide!

Hi, I have conducted a IRT analysis with Mplus, and saved the factor scores. But these factor scores don't correlate with the observed scores, how is this possible? They are supposed to correlate right? I used a WLSMV estimator.

Hi, I have another question. I am trying to fit a very large model, but when I do this I get the following warnings: WARNING: THE SAMPLE CORRELATION OF V97 AND V9 IS -0.999 DUE TO ZERO CELLS IN THE BIVARIATE TABLE

I get this warning for many more variables, but not all.

I checked the correlations, but these are not correlated this high, and not one correlation is below zero. What does this mean? What is meant by zero cells in the bivariate table?

Zero cells in the bivariate table of two dichtomous variables imply a correlation of plus or minus one. Both variables should not be used as they do not contribute any additional information. This can happen with small samples and skewed variables.

For all dichotomously scored items this was in fact true. But for all 4 partial credit items the threshold parameter estimates differed remarkably. Using MPlus, we got e.g. the threshold parameter estimates I69$1=0.697 and I69$2=2.276. Using CONQUEST we got the item difficulty 1.84734 and 0.11344 for the step parameter. This corresponds to the threshold parameter I69$1=1.96078 and I69$2=1.7339; that is the threshold parameters are unordered! The results for the other partial credit items were similar.

Both programs yield the same response category proportions. Increasing the number of nods doesn’t change the results. Inspecting the residual statistics reveals that the differences between the observed and the model implied response category proportions seem to be small. The item fit statistics do not indicate some kind of misfit, either.

I am fitting a single factor model with both continuous and dichotomous indicators. If I have all dichotomous outcomes, Mplus will output "IRT parameters" estimates and standard errors(the estimates are obtained through the conversion formula and I assume the standard errors are obtained by the delta method as in MacIntosh and Hashim). My question is: Is there a way to get Mplus to output these "IRT parameters" for the dichotomous indicators when I have a mix of dichotomous and continuous indicators?

I know I could use ML and get the IRT parameters directly but I am particularly interested in using WLS.

Example 5.5 in the manual gives the program for a 2PL Graded Response Model. I am specifying CATEGORICAL (ordered polytomous) indicators. If I am understanding correctly, using the WLSMV (instead of MLR) estimator gives me the same model but it is 2 parameter normal ogive and not 2PL.

1. Is this correct?

I have read elsewhere on this board that the loadings that are output for the WLSMV estimator (normal ogive(?) above) (A) do not correspond directly to IRT "a" parameters but (B) require a transformation to correspond to the "a" parameters, and that (C) the transformation is only possible if items only load onto a single factor.

2. Are (A), (B), and (C) correct? If so and especially if (C) is also correct, has anything been worked out yet anywhere to your knowledge that allows for the transformation if items that load onto >1 factor (I am doing a bi-factor model with WLSMV and want factor scores and "a" parameters for the general factor).

3. Can I get the standard errors for the factor scores when using the WLSMV estimator? They are automically given in the FSCORES file with the (more computationally intensive) 2PL model, but do not appear when using WLSMV.Is there a way to get them?

For a TWO-PARAMETER LOGISTIC ITEM RESPONSE THEORY (IRT) MODEL (as in example 5.5), could you please confirm whether the standard error of the factor scores (in the SAVEDATA file) is the inverse of the square root of the information function as defined in formula 14 of the MplusIRT2 document(http://www.statmodel.com/download/MplusIRT2.pdf). Thank you very much.

Dear Drs. Muthen, We would like to examine temporal measurement (non-)invariance in an IRT model with “dense” repeated measurement: participants (n = 100) completed the same 6 items (5 ordered categorical response options per item) every day over 28 days. Instead of estimating 28 factors simultaneously, we think about estimating a single factor for all 2800 days, using the TYPE=COMPLEX option to adjust for the non-independence of observations. In this case, would it make sense to introduce “temporal” covariates (e.g., day of assessment as continuous covariate, week 1 versus weeks 2-4, weekend-days versus weekdays) in MIMIC models to examine DIF based on time, or is there something about this model that would not be accounted for by the COMPLEX option? Thanks very much for your support.

It sounds like you can view this as 2-level data, where level 1 is time (28) and level 2 is subject (100), and where you have 6 outcomes. So you could model it via Type=Twolevel or Complex. The Twolevel structure is similar in structure to doing growth modeling as 2-level, where time-varying covariates can be handled as in UG ex9.16.

Thank you for your response. Will this FAQ include information on how I could compute the discrimination and difficulty parameters for such a model myself using the given loadings and thresholds? Or, if possible, would you tell me how that could be done in your next response in this thread?

I am interested in generating 2PL IRT model for dichotomous indicators where the latent trait is not normally distributed. I see how to generate non-normal indicators, but am not sure about the generation of factors with a particular skewness. I wonder if you could point me in the right direction. Thanks very much.

I would try to do that via mixtures. For instance, a 2-class mixture of two normal factors can represent a log-normal factor distribution - see pages 14-16 of the McLachlan-Peel (2000) book on mixtures.

There are probably exact results to be found, but otherwise you can take a trial and error approach to choosing means, variances, and class sizes to get the non-normality that you want.

If I was interested in looking at group DIF or group-differences in item endorsement using a 2-PL IRT. What is the difference in Mplus if I fit these two scenarios: (1)a stratified IRT in each group and then plotting ICCs from common items by group in the same figure VS. (2) fitting IRT with a covariate (see *example below) and then plotting the ICCs by group (using "name a set of values" command in plots)?

I saw a much more substantial difference in the stratified analysis while nearly identical curves in (2).

In your *Example you show a model that is not identified because you can't have all direct effects of a covariate on the items and also a covariate effect on the factor. Perhaps you mean that this is just a part of the model where there are other items that aren't directly affected by the covariate.

My colleague and I are trying to compare results from a traditional CFA to a multidimensional IRT analysis using the same data. We have polytomously-scored items and we are using MLR, specifying that our data are categorical.

We are also running the multidimensional IRT analysis using Conquest software in order to confirm our IRT results. When we do so, we get item fit indices in the output—the squared standardized residuals and a variance-weighted version for each item. That is, Conquest computes the average of the squared standardized model-based residuals for each item. We are wondering if MPlus would give us something like an item fit index that is comparable to this. We first thought of modification indices, but it is my understanding that you cannot get modification indices when you have categorical indicators. Is this true? Can you think of another index that would constitute a comparable indicator of individual item fit? Could we use standardized residuals from MPlus output in a similar way?

The CFA model for categorical indicators without the guessing parameter is the same model as IRT without the guessing parameter. You should not find differences when both are estimated using the same estimator.

We do not provide the fit index you mention. I don't think you will be helped by modification indices. Try TECH10 where univariate and bivariate fit is shown.

1. Does Mplus directly provide transformed values for my Step 1 analysis? How to obtain them? Or any other recommendations?

2. If I am to use plausible values before running EFA&CFA&SEM,does this mean I should obtain plausible values before IRT calibrating or do it after my Step 1 analysis? Further, if plausible values are to be used, at least how many sets of data will be acceptable for Mplus practioners? For 5 seems to be too many for my case.

It would be also much appreciated if you can recommend works using IRT and SEM in combination.

Yes, I have 117 observed indicators in my whole dataset. But to directly put them all together into a full model seems to make things too complicated.

It seems using composite variables would make things easier. I am wondering whether it is appropriate to weigh each observed variable with the IRT discriminality parameters before grouping them.

So my question is reduced to one: Does Mplus provide transformed response matrix based on IRT discriminality parameters? Or, the matrix with each item response vector multipled by its corresponding IRT discriminality value?

Apologies if this is an overly simple question. I am interested in fitting a 2-level IRT model where item responses (L1) are nested within individuals (L2). I’m stuck on setting up the data in a way that will be correctly analyzed. So, my current thinking is that the data should be assembled such that item responses are in 1 column vector stacked for all examinees. So, for a 3 item test taken by 10 examinees, my data would be 30 rows long. In de Boeck & Wilson (2007), the authors augment the data matrix with an identity matrix to facilitate model specification. Accordingly, the three item (i=1,2,3) 10 person (j = 1,..10) case would be specified something like

F1 by ITEM*i1 ITEM*i2 ITEM*i3

Where ITEM corresponds to the column vector of item responses and i(i) =1 when row_ji = i and 0 otherwise.

Is this also necessary in Mplus? Or is there another way to organize the data and specify the model? Specifically, I'm wondering if specifying the items as 'within' and using examinee id as the cluster variable in the usual 'wide' format is sufficient?

You should use wide format data where each of your three variables is one column of the data set. This becomes a single-level analysis because clustering is taken care of by multivariate analysis. See Example 5.5.

I want to test the difference between constrained Rasch model and free Rasch model, both with multiple groups examinees, so I am wondering which estimator I should use, WLSM, WLSMV, or some other estimators? By the way, I tried to use MLR estimator to estimate IRT parameters in multiple groups examinees, it didn't work. So could you please tell me the reason?

I am sorry to confuse you. Now, I need to adopt Rasch model to estimate items' difficulties and examinees' abilities. During the difficulties estimation, I want to separate my examinees into four groups, and then estimate items' difficulties by each group. I want to use two methods to estimate each group's items' difficulties. One is by constraining all groups' items' difficulties to be equal,such as b11=b21=b31=b41,b12=b22=b32=b42 (the first footnote of b stands for group, the second footnote of b stands for item); the other method is by freeing one of the four groups' item difficulties estimation while constraining the other three groups' items' difficulties to be equal.

Now, what I want to know is which estimator I should use when I am estimating items' difficulties using the two methods above, MLR, WLSM, WLSMV, or some other estimators?

My another question is whether we should use the same estimator when we are using Mplus to run Rasch model many times from different perspectives while all these operations are for the same paper?

The Rasch model usually is estimated using maximum likelihood estimation. That would be MLR. You could estimate the Rasch model using WLSV or WLSMV in which case you would use the DIFFTEST option to compare models. Unless you are writing a paper that compares different estimators, I would stick with one estimator.

I am using Mplus to generate thetas for subjects who took a test with known item parameters (generated by BILOG-MG and published by the test developers). By running a Two-Parameter Logistic IRT Model and fixing the item loadings (from BILOG output) and item thresholds (computed as BILOG_Threshold*Loading), the model runs great and generates thetas that seem plausible. As long as the variance is fixed @1, the reported Item Difficulty parameters in the Mplus output are exactly the same as the BILOG_Threshold, but the Item Discrimination parameters in the Mplus output are about half the size of the BILOG_Slope parameters. I was hoping to use a match between the parameters (Mplus_Discrimination & BILOG_Slope; Mplus_Difficulty & BILOG_Threshold) as a check that I did it right. I'm thinking maybe I didn't. Any input would be greatly appreciated. Thank you.

Slide 94 of the Topic 2 handout may be helpful. For the discrimination there is also the D factor which some of the IRT programs set at 1.7. So if BILOG doesn't use D in its 2PL, you probably need to multiply the Mplus "a" by 1.7.

Bengt – Thank you for the prompt reply, your thoughts, and the reference to further resources. Discrimination*1.7 brings me closer to the BILOG Slope, but still not on the money. To complicate things a bit more, when I allow the variance to be estimated freely the “a”*1.7 is closer to the BILOG slope than when the variance is fixed @1, which is a little frustrating because when I fix the variance @1, the Difficulty is spot on, but no so when free (Note: Fixing or freely estimating the factor mean has no effect either way). For example, the parameters for the first three items under each condition are as follows:

It seems to me that the spot on “b” parameters with the fixed variance is better than the appreciably better match on the “a” and worsened match on the “b” found with the free variance. I would appreciate your thoughts. Maybe this is just an inappropriate use of Mplus (i.e., for scoring tests/theta generation rather than model fitting & parameter estimation); do you recommend a different approach? Thank you again for your time and insight.

I have used MPlus version 3, in 2009. The program below was used to produce Item Characteristics Curves. I have run this program today, but when I tried to view the graph, it wasn't possible. I used graph, view graph, which as far as I remember allow the viewing but didn't seem to do so. Advice is much appreciated! Here is the program code:

I have a few questions about my model; I have fitted a IRT model by fixing the factor variances to 1, and freeing the factor loadings like this: f1 BY u1* u2-u16; f2 BY u17* u18-u32; f3 BY u33* u34-u48; f4 BY u49* u50-u64; f5 BY u65* u66-u80; f6 BY u81* u82-u96; f7 BY u97* u98-u112; f8 BY u113* u114-u128; f9 BY u129* u130-u144; h1 BY f1-f3; h2 BY f4-f6; h3 BY f7-f9; p BY h1-h3; f1-f9@1; h1-h3@1; p@1; My question is, is this right or must I also free the loadings of f1, f4, f7 and h1? Besides this, I want to look at the fit of this model. In order to do this, I thought that it wouldn't be necessary to fix the variances and free the loadings, because I am only interested in the fit indices. Is this right? Because the fit indices of the IRT model and the normal CFA model are very different, especially the CFI, TLI and WRMR. So which model should I choose to determine the fit of the model?

You need to free the first factor loading of h1, h2, h3, and p if fix the factor variances to one. Model fit will be the same if you fix the factor loadings to one versus freeing all factor loadings and fix the factor variances to one. If you don't get the same fit with these two parameterizations, you are doing something wrong.

Hi Does Mplus do IRT simulation. In the simulation chapter of the user guide I didn't see anything on this subject. could you please refer me to an article which teaches IRT simulation step by step with Mplus the same way as your 2002 article taught SEM simulatin in Structural Equation Journal

Hi - I am trying convert an analysis with several latent factors to an IRT framework and I am trying to understand the conversion to IRT parameters (difficulty and discrimination). I have been reading and re-reading the web notes and discussion posts. You mention in Web note 4 that an increased residual variance (theta) gives rise to a flatter conditional probability curve. But to standardise y* under the LRV formulation with theta = 1 - factor loading^2*psi doesn't seem to take into account the residual variance at all? There is a footnote that suggest the R^2 output can be used to estimate theta in the Delta parameterization. Does this only apply in multiple group comparisons?

The extra residual variance parameters, which go beyond what is available in conventional IRT, are only relevant in multiple-group or multiple-timepoint settings. You need to fix them in a referent group, whereas they are free in the other groups.

I'm a new user for Mplus. And now I have two questions about the use of IRT. The first one is that I was confused about the transformation of the a, and b estimator parameters. As I know, there are three kinds of formula: (1) a =loading /(sqrt(1-loading**2)); b =threshold/loading. (2) a =loading; b= threshold/loading. (3) a =loading/1.7; b=threshold/loading. When I used the ML estimate method, which one formula is right ? When the WLSMV method was used, which one is appropriate ? Besides, which one is the 'loading' and 'threshold' in the formula referred as "standardized", or "unstandardized" ? The second question is about the explanation of the result of TECH10, especial for the " Univariate Pearson Chi-Square" and "Univariate Log-Likelihood Chi-Square". How can I explain that the item is misfit ? I will be very appreciated for some related references suggestion.

I have a question about the transformation between IRT and CCFA. based on the technical appendix for IRT, using CCFA with logit link, If we fix the factor variance to be 1 and factor mean to be 0, we have Discrimination a= loading lambda/1.7; and Difficulty bi= threshold tau/ loading lambda. However,from other references (e.g. Wirth & Edwards, 2007; Kim & Yoon 2011), the Discrimination a= 1.7 (lambda/sqrt(1-lambda^2); Difficulty bi= threshold tau/ loading lambda;Can you please explain the source of disagreement for the parameter a? Thank you very much for your help.

Thanks, Bengt; that makes a lot of sense. But based on slide 94, if we fix the factor variance and means, using logit link a=lambda/1.7; using probit link, a=(lambda/sqrt(1-lambda^2). It seems if using the probit link, the scaling constant D should not be included in the equation, however, in other's notation a=1.7 (lambda/sqrt(1-lambda^2); is this an error, or am I missing something. Thanks again for your help

I agree that 1.7 does not belong with the expression using /sqrt(1-lambda^2) since that is probit (WLSMV with Delta parameterization). The constant 1.7 was introduced to make logit close to probit IRT estimates. Furthermore, these days it seems that 1.7 has been dropped from the logit expression.

Instead of leaving the "a" parameters float freely or fix them to 1 and thus obtain a Rasch model, I would like to set specific values for the mean and standard deviations for "a" (e.g. mean 0.8 and std. dev. 0.3).

Does anyone have an idea of script to obtain this? My attempts have all failed. I think I am dumb. >-;

Another question is if it will be possible to choose how to center the scale (either on b or on theta scores to have mean = 0 and std. dev. = 1).

I was wondering if you all had any recommendations about the best way to evaluate Rasch and 2PL IRT models when using the Bayes estimator. I see that the DIC and pD are not given. Thanks for any input!

Linda and Bengt, I jump in with a suggestion: if you're thinking to have a 3PL, you should explore the possibility to have a 4PL at the same time, as some studies suggested the higher asymptote may also be useful for modeling some kinds of items. Apart from some R scripts, almost all common IRT software packages do not have the 4PL.

Julien, that's a great idea! Apparently the modeling of the fourth parameter would not require too much additional engineering effort if compared to 3PL and would indeed help.

Here I have students from all academic years making a test with end-of-course level (destined to new graduates), so that the upper asymptote is indeed very crucial to obtain better fit.

I tried to use R packages catR and irtProb, and they can be useful for several tasks, but not for item parameter calibration. After you got the parameters, then you can estimate standard errors, test information curves, etc.

What I have tried to explore for item calibrating so far, with limited success, was a script on WinBUGS (Loken and Rulison, 2010, see below).

If you guys have any alternative idea whatsoever to estimate the 4th parameter with less hassle, I would be much grateful.

My questions are: (1) I was trying to use MLR, but an error message showed. Did I miss anything? (2) Although the overall model already stated "f1 with f2", the covariances between f1 and f2 are different in the male and the female groups. Why? (3) Item difficulty estimates are biased more in items of f2, and very different from true values. Did I do anything wrong in my syntax?

The height of the curves is lower in Mplus. The shapes are a somewhat different too.

I 2x checked and the slopes/discrimination parameters are really close. The thresholds are of course different, as you explained in an earlier post. The numbers of free parameters, AIC, and BIC are also the same. I'm pretty sure the same things are being estimated.

In this model, the slope parameter (discrimination) Ajx equals 1 for all items (same as in the 1-parameter logistic), however the location (difficulty) parameter Bjx is divided into an item component (DELTAj) and a category, or step component (TAUx), so that Bjx = DELTAj + TAUx. It is used for polytomous items, and is a special case of the partial credit model (PCM): in the RSM the logits equal (THETA - DELTAj - TAUx), whilst in the PCM the logits equal (THETA - Bjx).

Please, could you suggest how to translate this model into an Mplus syntax, if possible?

I haven't looked into this, but I wonder if one could use the Mplus "nu" parameter to capture DELTAj. I assume TAUx implies that x varies over the different categories of the item, so they are the Mplus threshold parameters.

With categorical items, the "nu" parameters are not activated (since they wouldn't all be identified together with the thresholds), but you essentially get them by placing a factor behind each item and letting that factor's mean (alpha) pick up the nu.

item M9 was answered by 0.8% (i.e., n=3) of the 375 students. M9 Category 1 0.992 372.000 Category 2 0.008 3.000 the Item Discrimination was F BY M9 -0.324 0.457 -0.708 0.479 and Item Difficulty was M9$1 -15.062 20.868 -0.722 0.470 This means the item is very easy. However, the item was only answered by less than 1% of the students and so having such a low difficulty estimate makes no sense. I checked this in both version 6 and 7. We have run the same data in RUMM2010 (1PL Rasch), ICL (3PL) and get more plausible location values for this item. My concern is whether there is an error in the code or the set up or whether it is possible to have such a low logit despite such high difficulty. Advice appreciated.

I've estimated a bifactor mimic model and want to convert the parameter estimates into IRT terms. I'd like to confirm that I have the generalization of equations (4) and (5) given by MacIntosh & Hashim (2003) to the bifactor case correct. I'm using the WLSMV estimator with delta parameterization, and I followed the steps described by MacIntosh and Hashim to set the means and variances of the latent variables to 0 and 1. Specifically, I centered the values of the dichotomous covariates about their means and set the residual variances of the latents to values estimated from an initial run.

(1) Would the correct equation for the discrimination of item j on the general factor be: a_j = lambda_jGen / sqrt(1 - lambda_jGen**2*psi_Gen - lambda_jLoc**2*psi_Loc), where psi_Gen and psi_Loc refer to the residual variances for the general and a local factor, respectively?

(2) Would the b-value for item j in group k on the general factor be: b_jk = (tau_j - beta_j*z_k) / lambda_jGen, where beta_j is the direct effect of a covariate on item j and z is the group indicator dummy variable?

and (3) is the dummy variable z appropriately coded 0 and 1 for the reference and focal groups, respectively, or should it be coded with the deviation scores from the mean, given that the covariate was centered for model estimation?

I want to run 2-group 1PL IRT model using Mplus. That is, I want to constrain the discrimination parameters within and across groups but do not want to constrain the difficulty parameters across groups. When I ran the code below, I got the error message (see below). Could you tell me what are wrong in my code? Without [u1-u25$1], I got the equal difficulty parameters across groups.

I have a quick question re: discrimination values in a 2PLM IRT model. I have read that these values can theoretically range anywhere yet in practice typically they fall between .50 and 2.50. I have conducted some models in which I get alpha parameters such as 36.26 and other large values. However, these models converge and there are no warnings. Is such a value simply impossible or, if the item infrequently occurs (binary) is such a high discriminatory value possible?

I apologize if the answer to my question is in this discussion board somewhere, but it seemed most appropriate to ask here.

I have a bi-factor model with ordered categorical items--1 general factor and 2 subfactors I'm trying to estimate. Can someone tell me the difference in the unstandardized and standardized coefficients when requesting MLR and WLSMV. With WLSMV, unstandardized and standardized are the same and look like factor loadings to me. If I use MLR, it looks like I'm getting discrimination and threshold parameters from an IRT model. Would this be a correct assessment?

I have a question about the IRT PARAMETERIZATION IN TWO-PARAMETER PROBIT METRIC WHERE THE PROBIT IS DISCRIMINATION*(THETA - DIFFICULTY) portion of the output.

I have two groups and for now I am using the theta parameterization and have fixed the residual variances at one in both groups. I have constrained the thresholds and the factor loadings to be invariant over groups and the factor means and variances are free.

I notice that in the IRT PARAMETERIZATION part of the output, that the item discriminations and item difficulties are not the same across groups. This confuses me. Above, Dr. Muthen mentions: "To summarize my view, there are two ways to capture DIF in Mplus modeling: (1) CFA with covariates and (2) multi-group analysis. To me, DIF means that for a given item you have different item characteristics curves for different subject groupings and both approaches capture this. "

Don't the differences between groups in item discriminations and difficulties given in the IRT parameterization portion of the output suggest that the ICCs would be different? But the factor loadings and thresholds are fixed to be invariant? Am I testing the invariance of the wrong parameters? Should I be testing the invariance of the parameters in the IRT parameterization if I am interested in dif?

Our Topic 2 handout, slide 94 shows how the default Mplus factor model parameterization using thresholds and loadings relate to the IRT parameterization with a and b. This shows that even with invariant thresholds and loadings, a and b will vary across groups when the factor mean and/or factor variance vary across groups.

Note that the IRT parameterization output refers to a factor with mean zero and variance 1.

My writing has been about DIF using the factor model parameterization.

Thank you for your response and I would like to ask a follow-up question if it is not too much trouble. In the IRT literature it is not uncommon to hear that an item is unbiased between groups A and B if and only if the two item characteristic curves are identical. I think Mellenbergh more generally states this as: An item is unbiased with respect to the variable G and given the variable Z if and only if f(X|g,z) = f(X|z) for all values g and z of the variables G and Z, where f(X|g,z) is the distribution of the item response given g and z and f(X|z) the distribution of the item responses given z; otherwise the item is biased.

I am wondering if you are thinking about dif differently, or if invariant factor loadings and thresholds in CFA with dichotomous items, with factor means and variances free, is consistent with the above definition.

I am trying to estimate IRT parameters for a large bank of cognitive ability items. I have about 600 items total with about 200 items loading on one of three correlated factors. My data is rather sparse because I had my trial sample take 36 items randomly drawn from the full item bank. My sample had over 12,000 people, so each item was seen by at least 300 people.

The data runs just fine as a unidimensional model, but as a multidimensional model, it has been running for 7 days.

Do you have any idea as to why the estimation is taking so long or suggestions as to what I can do to reach a valid solution?

If you ask for TECH8 you will see screen printing of the iterations. The screen printing will tell you how long each iteration takes and how fast or slow the convergence is so you can get an idea of total time required. I assume your items are declared as categorical so that numerical integration is called for. The screen should show 3 dimensions of integration which with the default 15 points per dimension gives over 3000 points and can take a while for n=12,000 (time is linear in n).

But it shouldn't be bad for a fast computer with say processors = 8 and i7 CPU.

To speed it up you can use integration=10. Or, you can say NOSERR in the output to not compute SEs (they take a while with so many parameters). Or, you can use Estimator=Bayes. Or, you can take a smaller random sample of your data and use those estimates as starting values for the full-sample run.

After having read through this thread in depth again a question came to mind...apologies if the answer is either obvious or already somewhere here or on SEMNET which I also searched.

Why is it the case that for binary indicators in CFA it is generally recommended to avoid ML or MLR (because of the non-normality of discrete/binary indicators) whereas in the IRT setup, ML(R) estimation is (based on my read) a somewhat better choice than limited information estimators...I understand the advantage of full information estimation but am missing something as to why then full-information methods would not just be favored regardless of the metric of your indicator

I think your question reflects a common misunderstanding that I have seen several times also in the literature, so I am glad you ask it to help clear it up.

When it is said that ML should be avoided with binary indicators in CFA, the speaker/writer is thinking of ML as synonymous with analysis treating the indicators as continuous-normal outcomes. So the mistake is to think that ML means continuous variables. Treating binary variables as continuous is of course not optimal. But ML does not imply analyzing your variables as if they were continuous. The fact that ML (or MLR) can be used for variables other than continuous is clear from a study of logistic regression and it is clear from IRT (which is logistic regression with factors as IVs), and it is clear from regression with a count DV.

Apologize if the answers to this questions is too obvious or somewhere in the forum.

I am trying to run a multidimensional IRT with nominal indicators (two latent continuous factors with nominal indicators, and several indicators have crossing "loadings"). I am just wondering if MPlus can handle it? I know MPlus has NOMINAL option, but I am not sure if I can define the observed indicators as NOMINAL in a measurement model. Thank you.