Search Site

Thursday, January 15, 2015

Fair Grading in a World of Curves? Concepts and an Algorithm...

One of the First World Problems of the prawfessoriat is that law school courses tend have mandatory curves, but, of course, student performance never exactly matches those curves, and so some tweaking is required. In pursuit of improving my code-writing skills (that is, teaching myself up from barely literate to rudimentary), I've been planning to write a script that can take raw scores in numerical form and spit them back out a form fit to (any) curve in a fair fashion. But, of course, what counts as "fair" is open to question, so I'd love to solicit your feedback on the algorithm below, and the underlying concepts, before I try to implement it in code form. Is this fair, do you think? Are there ways to make it more fair? I'm sure there's literature on this subject, but I don't know it---do any readers?

Wonky stuff below the fold:

The following assumes that grades (both raw and scaled) naturally lie on an interval scale, so that we lose information about variance between a pair of students' performances when we change the relative distances between their scores; the primary notion of fairness then becomes minimizing the amount of information that must be lost to fit the curve. The interval scale assumption also allows linear transformations to be made without loss of information. But please do question it if you don't think it's realistic.

There is also underlying assumption that the "natural" grades given by a professor pre-curving are accurate (or at least have only unbiased error). That assumption may be false: it may be that the discipline of the curve eliminates systematic grading bias, and that the best way to grade is not to give a raw score at all, but to scale it from the start by, e.g., simply ordering work product by quality and then assigning points off the ranking to each exam. I'm definitely interested to hear if anyone believes this to be true, especially if that belief is backed by research.

A Candidate Algorithm for Fair Curve-Fitting

1. Accept input consisting of an ordered CSV or similar, with grades in raw numerical format, scale of instructor's choice, plus high and low points of raw scale. Something special may have to be done with endpoint scores at this point, especially zeroes; not sure yet.

2. Accept input defining curve by number of buckets, high and low point of each bucket, and percentage (or range thereof) of students in each bucket.

3. Apply linear transformations to reduce raw score to scale defined by curve (i.e., subtract from both as necessary to set origin at zero, divide or multiply as appropriate, then add back in for the scaled score).

4. Check to see if there are further linear transformations (here, and below, addition or subtraction will be the key) to be applied to the whole scale that make it satisfy the curve. This is a very happy outcome, permitting no loss of information, and is likely to be the case if the instructor grades roughly on the curve, but is systematically either too generous or too stingy. If so, go to #7. If not, then things get more complicated, move on.

5. Search to see if there is a point such that all scores to one side of that point fit the curve, either on their own or with linear transformations to the whole scale, while the other side fits the curve with only linear transformations to the whole thing. If so, go to #7. If not, move on.

6. Search to find the smallest number G of groups into which the set of students can be divided, such that the curve is fit with either no changes or with a linear transformation to each group. (Steps 4 and 5 are actually just special cases of step 6 with G = 1 and 2, respectively.)

7. Test all grade distributions produced by wherever the algorithm stopped to insure that for all pairs of students i, j, if i > j in raw, then i >= j in scaled. (Minimum fairness condition.) Throw out all distributions that do not satisfy that criterion. If no such distributions remain, return to step 6, increasing minimum G by 1.

8. If there is a unique set of grades produced by wherever the algorithm stopped, apply it, spit out the grades, and go home happy. If there are multiple such sets, apply them all and report the entire set to instructor, instructor chooses depending on how generous s/he wants to be.

As I am not that skilled a programmer (yet!), the script implementing this algorithm will doubtless be very inefficient when I write it, lots of brute force-ey flinging changes against a vector of scores and then testing. (Or I may just never write this and leave it as a model for more skilled programmers.) But it seems fair to me, in that its goal is to minimize the loss of information between raw score and scaled score.

One possible point of disagreement is that, as written, steps 5 and 6 fail to take into account the magnitude of transformation applied. It seems to me that for any fixed G, transformations that minimize the differences between the magnitude of addition or subtraction applied to different groups are to be preferred. (We ought to prefer a situation where the curve causes clusters of students to get small changes to their grades relative to other clusters over situations where the curve causes them to get big changes to their grades relative to other clsters.) However, while this could be taken into account holding G fixed without loss of information (and perhaps it should be, as a step 7.5), it is hard to know how it should be traded off while letting G float free. Should, for example, we prefer five groups with relatively small magnitudes of distortion between the groups, or three groups with relatively large magnitudes of distortion between them? Or is there a way to sum up the total distortion introduced by both adding groups and increasing the magnitude of changes between them? (Both are bad: the former means that more students no longer occupy the same places on an interval scale with respect to one another as they did before applying the curve; the latter means that the differences between the students who do not occupy those same places are larger.)

There's probably some highly math-ey way to do all of this, perhaps imported from psychometricians (who do this sort of stuff with test scores all the time), but (apart from the fact that I don't know much psychometrics) the other constraint, it seems to me, is that any method ought not to be a total black box: a non-mathey law professor ought to be able to understand what's going on with the algorithm s/he uses to calculate grades. Perhaps that is misguided?

Thanks DMP---that's an interesting approach. I don't know k-means clustering, but based on a quick wikipedia skim, it looks like a plausible approach---worth playing around with this code on some real grades.

(Although it pains me a little to hear R described as easy to import/export CSVs from. Every single time I've ever done stats, the hardest thing has been to get my data into R, regardless of original format. One time I even killed a whole project because I simply could not get someone's SPSS data into a form I could work with...)

It looks like there are different problems based on how the school sets up the curve. I had a go at the problem where you have a mean grade and a fixed number of buckets. A reasonable approach is to use k-means clustering to group the students into buckets, and then assign the buckets various grades until you've come up with a combination that gets you closest to the desired mean (i.e. one combo goes A, A-, B, B-, C, another goes A+, B, B-, C+,C, etc.). See the below implementation in the R language (which is very easy to import/export csvs from).

#partition the data set into the number of buckets
x = kmeans(samplegrades,numpartitions)
y = rank(x$center)[x$cluster] #force the groupings into rank order

#loop through various combinations of grades to get avg grades for each combo
vectorofaverages = numeric(ncol(matrixofgradecombos)) #allocate the vector
for(i in 1:length(vectorofaverages)){ #get mean grade of all combinations
vectorofaverages[i] = mean(ranktogrademap[matrixofgradecombos[,i][y]])
}

#identify the combo that gets you closest to the desired mean
gradecomboindex = which.min(abs(vectorofaverages - meangrade))

Joey - I'd hate to be the student with the F- so the rest of the GPA works out...

To answer your question, different schools do it different ways. My school does the bucket method throughout, but I know others that do it as you describe. The benefit of the bucket method, especially in the first year, is that bimodal graders don't dominate who gets on to law review.

This is perhaps an obvious question, but do most law schools really specify "curves"?

Mine just specifies a required mean grade -- which faculty and students refer to colloquially as "the curve" -- but this leaves a lot of flexibility. If the grade distribution looks like a sharp pointy curve around that point, a bimodal distribution, a flat line, whatever, it's fine as long as the grades average out to the required mean. (For first-year courses it's a little more prescriptive -- the school has some requirements about the % of A's, B's, etc., which is closer to requiring a "curve".)

I like the way my school does it (for non-first-year courses) because let's face it, sometimes a class really does have a bimodal or otherwise other-than-normal distribution and it would be somewhat odd, creating arbitrary winners and losers, to require a fit to a pre-determined curve.

Posted by: Joey | Jan 17, 2015 12:02:45 PM

I would recommend you read and consult "Effective Grading" 2d ed. by Barbara Walvoord & Virginia Anderson. Really a comprehensive guide to grading and evaluating students.

Posted by: Eugene Pekar | Jan 16, 2015 12:49:53 PM

Not directly on point, but if I were putting together a tool to be distributed, I would not try to support arbitrary CSV files. It's an ill specified file format.

Posted by: brad | Jan 16, 2015 12:11:02 PM

What James said. This is how I do it, albeit by hand. I have a spreadsheet with the percentile sizes for each cut line and it tallies the score needed to hit that cutline. Because there are ties, that rarely works out evenly (though in my experience it is always awfully close). From there, I can manually adjust the score breaks to match the cut lines as closely as possible, or decide whether I'm going to deviate because I just don't think the difference at that cutline is the difference between an A and a B. If you want an example, let me know.

As for your transformations, it's been a while since I did hard core math, but my experience is that you want to keep most transformations exponential (log based) or multiplicative (percentage of top score), etc., to bring numbers closer or farther. I don't think adding and subtracting is a great transformation because it doesn't actually change the shape of the curve. In that sense, even multiplication doesn't do much.

While it's not been popular in law schools, traditionally the way to minimize error in grading has been to increase the number of things graded, so that any particular error is less likely to make a significant difference in the final grade. Unfortunately, law courses—especially first year courses—have a long tradition of being graded solely on the basis of a single exam. Adding a midterm, a short brief, or some other assignment can help to minimize these errors. Plus, you have the added benefits that come from having more graded assignments: you can test on a greater variety of issues/skills, students are more likely to learn as they go along rather than cram at the end, etc.

Posted by: Charles Paul Hoffman | Jan 15, 2015 4:50:00 PM

Aah, I see, yes, James, excellent point. Any thoughts on how one might figure out the algorithm to carry that out?

Paul, my formulation of the problem automatically preserves the ordering of students. That's why I posed the optimization problem as a search for the best cutlines. If the cutline between A- and B+ is at 59, then any student who got a 59 or above gets at least an A- and any student who got a 58 or below gets at most a B+. Students with different raw scores might get the same grade, but they will never get reversed grades. Someone who implemented an optimization algorithm as I describe wouldn't need a step corresponding to your #7.

Phil - a big problem (and the main justification as far as I can see for curves) is that absent a curve, different professors will end up with different means for their grades. This is highly problematic as there is no a priori reason to believe that the students assigned to Professor A are markedly different from the students assigned to Professor B. Indeed, assuming that (like at my school) curves are used mostly for the first year courses and students are randomly assigned to sections in their first year, your assumption would probably be the opposite - that the average ability of the students in Professor A and Professor B's classes ought to be the same. If this is the case, then having different means for Professors A and B means that students in one Professors class are getting better/worse grades than they would for exactly the same work in the other class. That's bad. Thus the curves are designed to ensure that a person who makes a A in Professor A's class would also make an A in Professor B's class. There is no "grade they earned" in an absolute sense. Rather, we try to make sure that they are getting a fair grade relative to other students.

Posted by: anon | Jan 15, 2015 3:56:30 PM

Thanks Former Editor and anon (both very helpful). Phil, the problem is that law schools often have rules requiring a curve, which cannot be changed by an individual faculty member. Now, your answer might be "then show up at the next faculty meeting demanding a change," but that's a much longer conversation. (In the kind of employment world students face, where grades really matter, it makes sense from the standpoint of fairness to ensure that every class is on a consistent curve.) In the world we're in, grades must be fit to curves.

I spotted your problem: "student performance never exactly matches those curves, and so some tweaking is required"

Tweaking is required to do what, exactly? Support the idea that your students are a statistically representative subset of what you think the general population looks like? The notion is preposterous. There is probably an error in your conception of the general population and your students are not a random sample.

Here's your solution: Develop an appropriate grading rubric and give the students the grade they earned. Stop.

Posted by: Phil | Jan 15, 2015 1:53:55 PM

I think another problem might be that you assume your raw scores are errorless (at least I don't see any treatment of error in your algorithm). There is likely error in your raw scores that may come from a number of different sources, particularly if you use essay tests rather than multiple choice. Even if you use multiple choice questions, it is unlikely that the raw scores you have perfectly match whatever we might think of as the students' understanding of the course material. Thus, if given a situation where you have two students who have different raw scores but whose error ranges overlap, I don't think you can be sure that the student with the higher raw score really is a better student than the student with the lower raw score. For this reason, I try to make sure that students whose grades are quite close receive the same grade. I believe there is a way to estimate the errors in the scores, although I don't formally do this when computing my own grades. Nevertheless, if I were going to write an algorithm for this, I would probably want to calculate error ranges and try (as much as possible) to make sure that students whose error ranges overlap receive the same grade.
P.S. I am not formally trained in this stuff, so I could well be barking up the wrong tree :)

Posted by: anon | Jan 15, 2015 1:47:37 PM

Hmm... I like the correlation approach, but wonder if it would innappropriately allow for extreme changes in individual cases. For example, what if it turns out that the raw scores could be perfectly fit to the curve just by taking the highest-scoring student and moving her/him to the bottom? It seems like that would likely count as a v. high correlation between raw and scaled scores, but really screw the one student....

This strikes me as an ad hoc way of tackling the problem. "Minimum number of piecewise linear transformations" is not a metric that tracks the goals of curving. It prioritizes simplicity over fairness. Given your view that a curving algorithm should minimize the loss of "information" in the students' raw scores, a better metric of algorithmic quality would measure the distortion introduced by fitting raw grades to the curve. The greater the correlation between the raw scores and the curved grades, plotted against each other on their respective intervals, the better the curve.

If that's the case, then this is an optimization problem: select the N-1 cutlines between the N grading buckets in a way that maximizes a measure of correlation between raw scores and curved grades. You could treat the size of the curve-mandated buckets as a constraint on the optimization, or you could create another metric to evaluate the degree to which an assignment deviates from ideal bucket sizes. If the latter, then you need a function that weights the two metrics -- distortion of raw scores and fidelity to ideal distribution -- to produce an overall quality metric. Then you look for an algorithm that optimizes this overall quality metric in a computationally feasible way. (If necessary, go back to earlier stages and select different metrics to make the optimization algorithm more computationally tractable.)