Outline

Aim of the study: The influence of examiners (“rater bias”) is a serious threat to the objectivity of assessments, e. g. in oral exams. The purposes of this study were to analyze different types of rater effects in an OSCE in internal medicine at the University of Heidelberg, to establish a standard procedure for controlling rater effects and to investigate their influence on the reliability of the OSCE. We considered three types of rater effects: Differences of mean scores, differences in variation of scores and differences of discriminative power between examiners.

Methods: The OSCE “Internal Medicine” consisted of 12 stations. The examinations were carried out in 6 sessions on two days. 139 students participated in the examination. A total of 39 examiners took part, nine of them on both days, the others only on one day. Students were assigned randomly to the different sessions. To prove the existence of the mentioned rater effects, we compared the score distributions obtained at parallel stations and correlations of scores with corrected total scores between different examiners (Kruskal-Wallis-test for location, Fligner-Killeen-test for dispersion and asymptotic Fisher-z for rank correlation). A generalizability analysis was used to estimate the influence of rater effects on the reliability of the OSCE.

Results: All of the three rater effects could be verified in the OSCE examination. We found significant differences of mean scores and of variances of scores between examiners at 9 respectively 5 of the 12 OSCE stations. Significant differences of correlations with corrected total scores (discriminative power of raters) could be shown at 4 stations. However, the influence on the reliability of the examination was only moderate (dependability coefficient Ф = 0.825, estimated dependability without rater effects Фc = 0.853).

Conclusion: To ensure the quality of an OSCE, all types of rater bias have to be controlled, so statistical analyses of examinations are always necessary (classical test theory alone is not sufficient). Appropriate measures have to be taken to ensure the quality of the assessment, e. g. improvements of station materials and specific examiner trainings. Nevertheless, an OSCE with a sufficient number of stations may be a reliable and robust examination even in the case of singular strong rater effects [1].