Dissertation/Thesis Abstract

A Comparative Study of IRT Models for Rater Effects and Double Scoring

by Song, Yoon Ah, Ph.D., The University of Iowa, 2019, 168; 22617490

Abstract (Summary)

In dealing with rater effects, double scoring is a popular method to control the quality of ratings for tests including constructed-response (CR) type of items. Treating individual multiple ratings as independent violates the local independence assumption in item response theory (IRT). The typical way to fit standard IRT models to multiple ratings is to use the linear combination of multiple ratings as item scores, such as sum or average scores. However, these summed or averaged score approaches have limitations because it requires the adjustment of original item score categories and still contains rater effects in item scores. The purpose of this dissertation is to assess the effectiveness of using double ratings over single ratings in standard IRT models when rater effects are present, and to compare the performance of standard and newer IRT models for rater effects and multiple ratings, known to correct rater effects from parameter estimation and preserve the original item score categories.

Two simulation studies examined the accuracy of IRT models. As such, the number of ratings and IRT models were considered as main factors in the simulation study. The number of ratings includes single and double ratings. Two IRT models entail the generalized partial credit model (GPCM) and hierarchical rater model (HRM), each representing a standard IRT model and the IRT model for multiple ratings and rater effects. The HRM was used to generate ratings with rater effects. Then the GPCM and HRM were fitted to ratings. All the ratings were generated with the combination of other study factors, including sample size, test length, rater effects, and number of score categories. Results were compared and interpreted relative to baseline conditions, where ratings were generated with no rater effects.

The main findings of this dissertation were as follows: (1) using single ratings as item scores in rater effect conditions reduced the accuracy of proficiency estimation in the GPCM; (2) double scoring methods relieved the impact of rater effects on proficiency estimation and improved accuracy in the GPCM; (3) for double ratings, the HRM showed better performance than the GPCM using summed item scores; (4) as more items and larger number of score categories were used, accuracy of proficiency estimation improved, in general.

Terms of Use

The supplemental file or files you are about to download were provided to ProQuest by the author as part of a dissertation or thesis.
The supplemental files are provided "AS IS" without warranty.
ProQuest is not responsible for the content, format or impact on the supplemental file(s) on our system.
in some cases, the file type may be unknown or may be a .exe file.
We recommend caution as you open such files.

Copyright of the original materials contained in the supplemental file is retained by the author and your access to the supplemental files is subject to the ProQuest Terms and Conditions of use.

Depending on the size of the file(s) you are downloading, the system may take some time to download them. Please be patient.