Summary: Taking an approach similar to that of James (2006)–the subject of Monday’s blog post–Willoughby and Gustafson investigate the impact of grading incentives on student participation in astronomy courses through the analysis of audio-recordings of student small-group discussions of clicker questions. They improve on James’ study by comparing high-stakes and low-stakes grading schemes within different sections of the same course taught by the same instructor. James considered two different courses each with a different instructor. Another difference is that James studied the discussions between pairs of students; Willoughby and Gustafson looked at discussions among student groups of size four.

The course in this study was an introductory astronomy course for non-majors taught by one of the authors of the study. Four sections of the course were studied, two in the spring 2007 semester and two in the fall 2007 semester. In each of the four sections, clicker scores contributed 4% of the students’ course grades. In the two “high-stakes” sections, correct answers to clicker questions were worth one point each; incorrect answers worth nothing. In the “low-stakes” sections, all answers to clicker questions (correct or not) were worth one point each.

Audio-recordings of student small-group discussions were only collected in the spring 2007 semester. For these sections, student groups in the high-stakes section voted as a block in response to clicker questions 69% of the time, whereas student groups in the low-stakes section block-voted only 45% of the time, a statistically significant difference (p<0.0005). Since there was no statistically significant difference in block-voting between the high-stakes and low-stakes sections in the fall, when audio-recorders were not used, the authors conclude that the difference observed in the spring might be due to the Hawthorne effect. That is, since students were visually reminded that they were being studied (by the audio recorders), they might have altered their behavior. The authors plan to conduct a follow-up experiment designed to clarify the effect of recorders on student behavior.

The authors note that the results for the spring semester (when audio-recorders were used) are consistent with James’ results. They point out that James reported much higher rates of block-voting in both the high-stakes and low-stakes classes, but this makes sense given that (a) consensus is more difficult to achieve among four students than between pairs of students and (b) clicker scores contributed much more to students’ course grades in James’ study.

Although in the spring semester sections, students in the high-stakes section answered clicker questions correctly more often than students in the low-stakes section (57% vs. 50%), both groups of students performed equally well in terms of course grades and performance on the Astronomy Diagnostic Test (ADT), a “reliable and validated exam on general astronomy knowledge usually taught in high school science courses.” (The ADT was also used by Len, 2007.) The authors conclude that this provides evidence that the results of clicker questions in high-stakes settings may not be as accurate as those in low-stakes settings, a conclusion echoed by James.

Analysis of the audio-recordings in the spring semester indicated not only that students spoke more in the low-stakes section than in the high-stakes section, but also that the nature of their conversations were different. In the low-stakes section, students more frequently stated answer preferences, asked for clarification, restated the question, and articulated new questions. The authors conclude that high-stakes grading of clicker questions “will not lead to an increase in frank discussions among the students.”

Comments: These findings provide additional evidence in favor of the claims that James made in his article, that high-stakes grading schemes inhibit balanced and open student discussion during peer instruction time and lead to clicker question results that are less accurate assessments of student understanding. As a said on Monday, these are important findings for instructors making choices about how to grade clicker questions, particularly instructors wishing to encourage useful peer instruction discussions and generate data on student learning useful for making agile teaching decisions.

As with James’ study, I’m impressed by the qualitative methods employed by Willoughby and Gustafson. Audio-recording student conversations during clicker questions, then coding those conversations to find identify patterns in the nature of those conversations is a very useful method for understanding small group learning dynamics. The methods used here lend a lot of weight to the authors’ conclusion that high-stakes environments negatively impact the quality of peer discussions.

The degree to which block-voting was apparently influenced by the presence of the audio-recorders in this study is surprising to me. I’m not sure why this would be the case. Perhaps students who were recorded in the low-stakes class thought that if their small groups didn’t agree on the answer to a clicker question but voted as if they did agree that the audio-recording would reveal this discrepancy to their instructors, so they voted more honestly. Students in the high-stakes class might have felt a similar concern but the concern was outweighed by their interest in scoring more points on clicker questions by answering accurately.

If that’s the case, however, it undercuts the assertion that lower-stakes grading schemes yield clicker results that more accurately reflect student understanding. On the other hand, since the clicker questions only contributed 4% to the students’ overall course grade in the semester in which audio-recorders were not used, both sections were effectively low-stakes from the students’ point of view, which would explain the lack of difference in block-voting patterns between the two sections. Recall in James’ study, clicker questions contributed much higher percentages of the students’ course grades, magnifying the difference between high-stakes and low-stakes grading schemes.

I look forward to hearing the results of the authors’ follow-up study, the one designed to shed light on the effect of the presence of audio-recorders on student discussions and voting.