Binomial distributions and multiple choice tests

Readers of my blog know that I generally regard multiple choice tests (MCTs) as an adequate tool to assess student knowledge of, and proficiency with, a given set of topics. I have written about this subject here and here.

No, I do not think that MCTs are perfect, nor do I deem them necessarily the best testing methodology all the time, for any subject at any level [0] — simply that for some subjects (e.g., introductory physics) they have their place, and often times offer distinct advantages over other procedures.
Obviously, much like it is possible for instructors to inflict upon their students a confusing, unfair or otherwise poorly designed problem-based test, it is equally possible to botch a MCT. There are several pitfalls of which a conscientious instructor should be aware.
The one pitfall that I would like to discuss in this post, is the intrinsic element of randomness of MCTs, namely the possibility that even an unprepared (in fact, in principle utterly clueless) student will pick the right answers to a few questions, just by chance. So, how much can this in practice affect the outcome of the test, by unfairly rewarding and mis-diagnosing the proficiency of some (possibly many) lucky individuals ?

Case in point
Over the past few years I have taught large-enrolment (four hundred students) introductory physics courses, and made extensive use of MCTs. My typical final exam will last two hours, and consist of twenty questions, each with five answers, one being the correct one. Students have to pick one of the five. Ideally, an instructor would want a passing grade to go to a student able to pick at least a half of the right answers, i.e., scoring at least 50%. Of course, in actuality a test goes the way it goes; sometimes the instructor will fail to calibrate the test properly, and students will not perform as well as expected. In those situations, one has to worry about the possible “contaminating” effect of correct answers picked by accident.

In order to establish the above contention more quantitatively, suppose, for the sake of argument, that I gave out a completely crazy test, and that the four hundred students writing it had absolutely no clue as to what it is all about. In other words, they are unable to select one of the five answers to any of the twenty questions, based on any criterion having to do with course content. Thus, they have no choice but resort to picking randomly one of of the five answers for each of the twenty questions.
In this (purely hypothetical, of course) scenario, assuming that each student makes a completely random selection on each question, one may expect the grade distribution for such a large class to be approximately binomial, i.e., roughly the following will be observed:

5 students will get no answer correctly, i.e., score zero

23 students will pick one correct answer

55 students will pick two correct answers

82 students will pick three correct answers

87 students will pick four correct answers

70 students will pick five correct answers

44 students will pick six correct answers

22 students will pick seven correct answers

9 students will pick eight correct answers

3 students will pick nine correct answers

(Occasionally, one of the students will be lucky enough to score exactly 50%).

The class average is, of course, 20%, but, as shown above, not all students will get the same score. By sheer randomness, and randomness alone, some students will pick a relatively large number of right answers — in fact, some of them will approach 50%, conceivably a passing score. Now, suppose the instructor sees the above grade distribution and concludes that, while the class was obviously altogether unprepared for the test (which was perhaps a bit too hard), nevertheless the distribution itself shows that some students, the few ones who scored close to 50%, have greater knowledge than others.
Thus, the instructor decides to grade on a curve, and assigns an A to the twelve students who picked 8 or 9 correct answers, B to the 66 who picked 6 or 7, C to the 157 who scored 4 or 5, D to the 137 who got two or three answers right, and fails the rest.
Would that make any sense ? An instructor doing this would essentially rank students based on their luck alone, not on their knowledge of the subject matter. But what should the instructor do, then, in a situation like this one ? Fail the whole class ? Perhaps, but in my opinion the most reasonable, fair course of action would be that of administering a new test [1], or in any case discarding this one as fatally flawed, and unreliable.

Now, clearly the above example describes an extreme case, in which every single one of the 400 students picks each and every answer randomly, but it illustrates how, if the grade distribution coming out of a MCT features a low average (say less than 50%), telling apart genuine knowledge from sheer luck becomes increasingly difficult a proposition. This is particularly serious a problem if grading is done on a curve, and it is rendered even more serious if one considers that, in actuality, practically no student will be unable to answer any of the questions. In practice, every one of them will get at least a few right, redering the possible contaminating effect of lucky guesses even more significant.

My personal rule of thumb is, the average should be close to 70%. The likelihood that someone knowing the answers of less than ten of the twenty questions will pick fourteen correct answers by chance is slightly less than 4%. Granted, it is not perfect, but then again, no test is.

[1] There could be many a reason for such an appalling outcome, besides the test being seriously flawed. Me being a lousy teacher and/or the class being exceptionally weak — or a combination thereof being just two of them. However, that is immaterial in terms of deciding what the best course of action would be.

Share this:

Like this:

Related

This entry was posted on October 24, 2012 at 6:09 am and is filed under Academia, Physics, Science, Teaching. You can follow any responses to this entry through the RSS 2.0 feed.
You can skip to the end and leave a response. Pinging is currently not allowed.

Most of the multiple-choice tests I took during the last couple of years of high school and first couple of years of undergrad – all of which contributed in part to a final grade – featured negative marking. i.e. you get +1 point for answering a question correctly, 0 if you don’t answer it at all, and -1 if you answer incorrectly. It was set up this way to discourage guessing. How would that affect your distribution? (you know you want to run that analysis!)

Many of these tests were also of the kind where each question includes a list of five statements about a single phenomenon, and then the multiple choice part would be phrased something like “which of these five statements are correct? A: #1, 2 and 4; B: #1, 3 and 4; C: #2, 3 and 4; D: #3 and 5; E: #2 and 5”

Surely it depends how wrong the wrong answers are ? If you make sure the wrong answers are, say, expressed in inconsistent units, wrong by orders of magnitude, etc., then negative marking is perfectly justified, and will not be affected by simple algebraic mistakes !

Yeah but come on, now, it is an exam, people are nervous and check the wrong box by accident… plus, are we getting in the business of establishing how much “more wrong” is an answer than another ? And all of this for what, for some contamination from random lucky guesses, whose effect can be rendered negligible ? It’s not worth it.

Well, the farther away you can get from this randomness in the students’ solution – the better signal-to-noise ratio you can obtain at your measurement. I agree that the deeper you go into the random region, the more unfair is the redistributed curve (if you indeed normalize it somehow). This is why I always prefer to shift the grades by a constant (and not perform non-linear operations) if a correction in the distribution is needed.

The effects discussed here can also occur in non MCTs exams where people try to guess / apply a formula they remember a solution (but it is harder to quantify this effect here).

I remember giving a MCT exam to a class of ~100 very weak secondary school students in a physical science class. I remember there being 4 choices for each question. The median grade was, I recall, about 20%. This was rather distressing, as you can imagine, and I vowed not to write such hard tests in the future.

Regardless of the poor results, I think that it’s much better to have the number of choices in the area of 6-8, for noise-cancellation purposes. You can always help the guesswork by including 1-2 options where the units/magnitude don’t fit

I seem to recall that optimum-difficulty tests have the average between 70 and 80%. Anything below would mean the test is too hard, anything above that it’s too easy, and either extremal case comes with poorer correlation between score and knowledge.