Saturday, June 14, 2008

Applying Dan's assessment system, Part II - scoring

Note: A discussion of more general lessons learned while applying this assessment system is posted here. This entry is a dry, technical discussion of scoring and grade calculations, of interest only to teachers thinking of applying this system themselves.

Dan Meyer assesses students at least twice on every test item, scoring out of 4 each time. At the second round of assessment, he alters the possible points from 4 to 5. If a student scores a 4 on both rounds of assessment, she nets a 5. Otherwise, her highest score applies. Dan makes the second round of assessments a little harder than the first, so that a second 4 indicates greater skill than the first 4.

Altering the possible points from 4 to 5 entails that students who do not actually improve their performance from one assessment to the next automatically see their grade drop. For example, if a student scores a 3 on the first assessment, and then another 3 on the second, more demanding assessment of the same skill, the grade on that particular concept drops from a 3/4=75% to a 3/5=60% - from a solid C to a D-. This caused problems that almost made me abandon the system early last fall. For one, students got upset when their grade dropped without their knowledge having changed. Secondly, having grades drop after a progress report has been issued is not actually legal - or so I was told by my Department Head. An evening shortly after the first progress reports had been printed found us manually going through all the scores in the gradebook, altering the scores so that the grades would be the same as before, for example by changing 3/5 to 3.75/5, since that is equal to 3/4=75%.

Another problem with this system was that a scale from 0 to 4 seemed fairly coarse grained. Students who made a mistake significant enough not to merit a top score on the first assessment would be marked down by 25 percentage points, and if they did not improve markedly by the second assessment they would net a D-. Improvement from this D- would be possible only if they subsequently scored a perfect score. I first thought that the large number of skills and the repetition of assessments would lead to an adequate continuity of the total grading scale, that students might average a C by scoring perfectly on some skills and poorly on others. However, some students seemed, even when working hard, to be unable to ever score a 4. They'd always make some or other significant mistake, but not enough to make a D- seem appropriate. Now I am sure that in the mutual adjustment of quiz difficulty and scoring practice there is some wiggle room for making this work in a fair way, and I assume Dan Meyer has figured out a balance here. However, I ended up changing my grading scale.

Solving these problems proved pretty difficult without losing important features of the original system, however, and I found no perfect solution. I wanted my score assignment to do what Dan's did, in particular, to make it necessary for students to take every assessment twice, in order to ensure stability and retention of knowledge. Dan's practice of increasing the possible points does just that - students can not just be satisfied with their 3/4=75% and decide not to attempt the second assessment of the same skill. In the end I decided not to report students' scores online until they had had both assessments. I made the two assessments of equivalent difficulty (which simplified things for me) and then grades were assigned based on students' best two scores according to the following table:

In summary: For scores of 3 or lower, the higher score applies. If both scores are above 3, the grade is the average of the two. If one score is above a 3 and the other below, the grade is the average of 3 and the higher grade. With this score assignment, students still had an incentive to demonstrate perfect mastery twice, in order to net a grade of a 100.

A disadvantage of this system is it's clunkiness compared to Dan's simpler system. Much of the appeal of this whole approach to grading was its transparency to students, the clarity it could afford them about what to focus on. Some of this is lost with this conversion table. Also, since the best two scores count, the system appears to have somewhat more inertia; poor scores don't go away as fast as they seem to in the original system, where the better score always counts. This slower improvement is more appearance than reality, since two 4's are necessary to achieve a 100 in Dan's system too, but appearance matters in this context. The main disadvantage, however, was switching to this different scale after the first progress report, which caused some confusion and, I think, some loss of buy-in from students. They seemed a little less enthusiastic about completing their tracking sheets after that.

As an alternative, I experimented a little with just entering both of the best two scores into PowerGrade this spring, labeling the entries "Skill 14A" and "Skill 14B," for example, and assigning half weight to each. I am undecided about whether I will do this in the fall or just enter the composite grade. It is of paramount importance that the students understand the relation between the scores on the papers they get back and the scores on their grade printout, and this system would help in that regard, but it would make for a large number of gradebook entries, which means more messiness.

Finally, a note on the scoring of any quiz item: In some cases it made sense to assign a point value to different components of the test item, and sometimes I wrote the test items to make this possible. Other times, I evaluated the complete response to the test item as a whole, and assigned scores as follows:

Frankly, for some skills that did not lend themselves well to decomposition into parts with point values for each, I'd score based on my mental image of what a D-, a B- and an A would look like. If grades are supposed to be derived from scores rather than the other way around, that introduces some circularity that one might argue about, but I don't care. I think grades as descriptors of performance levels rather than as translations of some numerical score make more sense anyway. But that is another story that would make for a separate discussion.

And since this scoring business turned out so much trickier than I'd anticipated, well-thought out suggestions for making it clearer and fairer would be appreciated.

9 comments:

Early on my students hated the 5 point system enough that it became a 10 point system after the first quarter or so. (It meant that any error dropped you down to a 9, whether it applied directly to the skill at hand or not.) And at that time, I like you, made the questions equally difficult.

But I didn't figure out that I wanted to keep the TWO highest grades until much later in the year. That's my plan for next year. I need to consider if I keep the 10 points, or go back to 5 points. And how to increase buy-in for either system.

I considered a 10 point scale (I think my Department Head suggested that) but eventually figured that that would increase the time and effort spent on grading by too much. A 10 point scale with a clear rubric might be more fair and make more nuanced measurement possible - however, it would make it harder to get the assessments back by the next class period, and for these piecemeal assessments I tried hard to get them back fast. I felt returning them quickly, discussing typical mistakes and clearing time for a retest was more important than getting a more carefully considered grade. Learning over measurement precision, if you like.

H, My department is going whole hog with this in our Alg 1 classes this upcoming year. I have worked out a pretty good way (we think currently) to handle the grading. I will be posting on the grading aspect soon this week, with screen shots of the software screen and the point values.

bcarrera - the time between assessments varied a lot, too much, in fact - I have no good model to share here. The use of comprehensive tests for Algebra I took up much testing time, and the kids got tired of taking assessments. In Intermediate Algebra I often included some concept tests as parts of the comprehensive tests in order to get enough repetitions done without increasing the number of testing sessions. I really wasn't happy with the frequency of assessments, and what a good schedule would look like is something we should discuss.