Abstract This study presents a case study that applies mixed-effects ordered probit models for the purpose of utilizing scores from automated scoring engines (AE) to monitor and provide diagnostic feedback to human raters under training. Using the experimental rater training study data, we illustrate a statistical approach that can be used for analyzing three types of model-based rater effects – severity, accuracy, and centrality of each rater. Each of the rater effects is related with model parameters and compared for cases (a) when the AE is considered as the gold standard and (b) when the human expert (HE) is considered as the gold standard. Results showed that AE and HE scoring approaches agreed maximally (100%) in detecting severity. The agreement rate was somewhat lower for centrality (93.1%) and considerably lower for accuracy (66.4%). As a targeted case study, this examination concludes with practical implications and cautions for rater monitoring based on the AE.

Abstract This study presents a case study that applies mixed-effects ordered probit models for the purpose of utilizing scores from automated scoring engines (AE) to monitor and provide diagnostic feedback to human raters under training. Using the experimental rater training study data, we illustrate a statistical approach that can be used for analyzing three types of model-based rater effects – severity, accuracy, and centrality of each rater. Each of the rater effects is related with model parameters and compared for cases (a) when the AE is considered as the gold standard and (b) when the human expert (HE) is considered as the gold standard. Results showed that AE and HE scoring approaches agreed maximally (100%) in detecting severity. The agreement rate was somewhat lower for centrality (93.1%) and considerably lower for accuracy (66.4%). As a targeted case study, this examination concludes with practical implications and cautions for rater monitoring based on the AE.

Abstract Watanabe-Akaike information criterion (WAIC; Watanabe, 2010) and leave-one-out cross validation (LOO) are two fully Bayesian model selection methods that have been shown to perform better than other traditional information-criterion based model selection methods such as AIC, BIC, and DIC in the context of dichotomous IRT model selection. In this paper, we investigated whether such superior performances of WAIC and LOO can be generalized to scenarios of polytomous IRT model selection. Specifically, we conducted a simulation study to compare the statistical power rates of WAIC and LOO with those of AIC, BIC, AICc, SABIC, and DIC in selecting the optimal model among a group of polytomous IRT ones. We also used a real data set to demonstrate the use of LOO and WAIC for polytomous IRT model selection. The findings suggest that while all seven methods have excellent statistical power (greater than 0.93) to identify the true polytomous IRT model, WAIC and LOO seem to have slightly lower statistical power than DIC, the performance of which is marginally inferior to those of AIC, BIC, AICc, and SABIC.

Abstract In this study, a simulation-based method for computing joint maximum likelihood estimates of the reduced reparameterized unified model parameters is proposed. The central theme of the approach is to reduce the complexity of models to focus on their most critical elements. In particular, an approach analogous to joint maximum likelihood estimation is taken, and the latent attribute vectors are regarded as structural parameters, not parameters to be removed by integration with this approach, the joint distribution of the latent attributes does not have to be specified, which reduces the number of parameters in the model.

Abstract Situational judgment tests (SJTs) show useful levels of validity as predictors for job performance. However, scoring SJTs is challenging. We proposed to use the nominal response model (NRM)-based scoring methods for SJTs. Using real data from an SJT, we illustrated how to setup the NRM-based scoring rules and their rationales, how to examine dimensionality and reliability, and how to evaluate item-, measurement- and score- invariance across subgroups at different time points. We also compared the NRM-based scores with other commonly-used scoring approaches in terms of their relationships with relevant external variables for the studied SJT test.

Abstract Differential item functioning (DIF) occurs when individuals of the same true latent ability or psychological trait from different demographic populations are found to have different chances of endorsing an item category. The ability to identify such items depends on many factors, including the sample size of each demographic group, average true latent trait scores in each group, the chosen DIF assessment method, the magnitude of DIF effect and the quality of the anchor set. An anchor is a group of items free of DIF that establish a common metric between groups. If the anchor is contaminated, that is, if it contains a DIF item, the common metric is inappropriate. The current literature rarely addresses the relationship between item parameters, anchor selection, and subsequent DIF detection. In this two-part study, we show that the power of DIF detection is high when the anchor has highly discriminating items. Additionally, DIF items of large discrimination and moderate difficulty have generally high power when using a correctly specified anchor, given a fixed DIF effect size. Implications for anchor selection and DIF effect size research are discussed.