Comparable Measures

One of the holy grails of ed reform is comparability. The aim is a score or grade or rating that allows us to say definitively that Hypothetical High School is a better school than Imaginary Academy, that Pat O’Furniture teaching third grade in Iowa is a better teacher that Teachy McTeacherson teaching tenth grade Spanish in Maine.

But we’re also looking for evaluations that provide useful information, and there’s one of the major problems in the evaluation world these days.

The more comparable a measure is, the less useful it is.

Comparable measures must be reductive. In order to compare the elementary teacher in Iowa and the language teacher in Maine, we have to reduce the measure to elements that both teachers possess. This means that the measure must be simple, and it must ignore most of what makes each teacher unique.

This evaluation problem is mirrored by the challenges of student assessment in a classroom. For an example, let me talk about grading writing assignments. I do a multitude of assignment types in my classroom, but for our purposes, let’s focus on one particular type.

Many of my students essays are scored with a modified six traits writing rubric. The rubric breaks writing down into six different qualities; additionally I use a modified rubric that breaks each of the six into two or three sub-categories, for a grand total of fifteen specific characteristics of the writing. Those sub-scores provide a slightly richer assortment of data for the students and for me about where their strengths and weaknesses lie on a particular assignment. But I can’t really compare that batch of fifteen scores easily. If I want to compare and rank the “best” writers, I need to combine the scores into raw totals. But those raw totals, while easy to compare, provide little useful information. I can say that Pat “ranks” one point higher than Chris, but that doesn’t help either of them improve writing, and the raw comparison doesn’t show that while Pat has a strong voice but lousy technical control, Chris is a good technician but cold and boring.

And the most useful feedback and evaluation for both is actually a one-on-one conference with me (hard to squeeze in, but now and then I manage) which involves discussion and give and take and reflection and plans for future approaches. These are exceptionally useful, and completely non-comparable (unless, of course, we apply some reductive tool that “helps” me turn the conference into a score, but then we’ve lost everything that was useful to the student about the conference.)

But wait, you may say. Doesn’t that mean that our traditional grades are also reductive and pretty unuseful to the students. And I will say, yes, you are correct, but let’s save that (more radical) discussion for another day.

Comparable measures can be useful, and do have their place when they are used in ways that acknowledge how narrow they are. Need to know which student is tallest or most consistently shows up to class on time? We can do that.

But complex human behaviors can’t be reduced to comparison-ready measures without losing most of what matters in the translation. Not only are we talking about a complicated array of many different qualities, but those qualities themselves can cut in both positive and negative ways. It is one of the oldest observations about human character– a person’s greatest strength and most terrible weakness can be both sides of a single coin. I am a pretty solid and dependable guy; I am also pretty dull and unexciting. Two sides of the same coin. If the measurement system only weighs the coins without considering how they turn, we’ve missed important information.

Teachers teach different students. They teach different material. They teach it in different ways. They bring different strengths and weaknesses to the classroom, and those in turn may be weaknesses or strengths depending on what is in that classroom. We can’t evaluate a teacher in isolation from all other factors any more than we can decide whether or not a man is a good husband if he’s not in any sort of relationship.

If our goal is to do teacher evaluations that are helpful and useful, that help teachers develop and strengthen and grow their teaching skills, tools, and talents, then we must recognize that any such instrument will not yield easily comparable results. My question to reformsters is simple– would you rather help Teachy McTeacherson do the very best teaching she can, or do you want to be able to compare her to Pat O’Furniture? Which do you think will best serve the needs of the child? Because you can’t do both at once. It’s possible (though I have to mull some more) that you can’t do both at all. A yardstick can measure consistently, clearly and accurately– but only in one dimension, and teaching never happens in just one direction.