The purpose of this study was to investigate the effects of immediate item feedback (knowledge of results) on the reliability and validity of total test scores. Two types of feedback were studied: partial feedback (knowledge of correctness obtained by means of one attempt per item) and full feedback (knowledge of the correct response obtained by means of one attempt per item). Total feedback, or knowledge of the correct response obtained by answering until correct, was not involved.
Much of the previously published research on immediate item feedback appeared to be in need of larger sample sizes, and many designs did not appear to be capable of isolating the effects of feedback on mean test scores and reliability and validity coefficients of the test administered under feedback conditions. Their results were possibly confounded by using different response devices, time limits, numbers of attempts per item, and scoring strategies in the treatment and control groups.
Nine junior high schools in a large urban-suburban school district in the southeastern United States were selected using a stratified, random sampling procedure. Ninth grade students were assigned to cells in a 3 x 3, treatment-by-ability design and were tested on an adapted version of the SCAT-3B Verbal using Trainer-Tester response devices. Total scores of 2,023 students were analyzed with a non-orthogonal, ANOVA procedure and Scheffé comparisons. KR-20 reliability coefficients were analyzed using a k-sample test developed by Hakstian and Whalen (1976), and validity correlations with a subsequent reading achievement measure were analyzed with the usual tests for Pearson correlations.
Statistically significant main effects were found for treatment and ability, and the interaction was also significant. Examination for simple main effects indicated consistently lower means for the nonfeedback groups across ability levels and except for a reversal within the low ability level, full feedback means were generally lower than those for partial feedback. Reliability coefficients among the three treatment groups were statistically significant (partial feedback was greater than no feedback which was greater than full feedback), while the validity coefficients for partial and no feedback were significantly greater than that obtained for full feedback.
While a wealth of statistically significant findings were obtained, many of these significant differences were small. Criteria for judging education, or practical, significance were discussed in terms of effect sizes (Cohen, 1969) and increased test length. After adopting suggested criteria, only one finding was judged to be educationally significant: for low ability students, there was a substantial increment in mean verbal ability scores in favor of full feedback over no feedback. Otherwise this study failed to show any substantial benefit or harm in students receiving knowledge of results while taking tests similar to those used in the study. The relevance of the study to previous research and suggestions for further research were also discussed.