Everyone who is anyone in the field of testing actually has heard of Samuel Messick. The American Psychological Association has instituted a prestigious annual scientific award in his name, honouring his important work in the theory of test validity. I want to devote this, my first-ever blog post, to one of his seminal insights about testing. It’s arguable that his insight is critical for the future effectiveness of American education.

My logic goes this way: Every knowledgeable teacher and policy maker knows that tests, not standards, have the greater influence on what principals and teachers do in the classroom. My colleagues in Massachusetts—the state that has the most effective tests and standards—assure me that it’s the demanding, content-rich MCAS tests that determine what is taught in the schools. How could it be otherwise? The tests determine whether a student graduates or whether a school gets a high ranking. The standards do vaguely guide the contents of the tests, but the tests are the de facto standards.

It has been and will continue to be a lively blog topic to argue the pros and cons of the new Common Core State Standards in English Language Arts. But so far these arguments are more theological than empirical, since any number of future curricula—some good, some less so—can fulfill the requirements of the standards. I’m sure the debates over these not-yet-existent curricula will continue; so it won’t be spoiling anyone’s fun, if I observe that these heated debates bear a resemblance to what was called in the Middle Ages the Odium Theologicum over unseen and unknown entities. Ultimately these arguments will need to get tied down to tests. Tests will decide the actual educational effects of the Common Core Standards.

But Samuel Messick has enunciated some key principles that will need to be heeded by everyone involved in them if our schools are to improve in quality and equity—not only in the forty-plus states that have agreed to use the common core standards—but also in those states that have not. In all fifty states, tests will continue to determine classroom practice and hence the future effectiveness of American education.

In this post, I’ll sketch out one of Messick’s insights about test validity. In a second post, I’ll show how ignoring those insights has had deleterious effects in the era of NCLB. And in a third, and last on this topic, I’ll suggest policy principles to avoid ignoring the scientific acumen and practical wisdom of Samuel Messick in the era of the Common Core Standards.

******

Messick’s most distinctive observation shook up the testing world, and still does. He said that it was not a sufficient validation of a test to show that it exhibits “construct validity.” This term of art means that the test really does accurately estimate what it claims to estimate. No, said Messick, that is a purely technical criterion. Accurate estimates are not the only or chief function of tests in a society, In fact, accurate estimates can have unintended negative effects. In the world of work they can unfairly exclude people from jobs that they are well suited to perform. In the schools “valid” tests may actually cause a decline in the achievement being tested – a paradoxical outcome that I will stress in the three blogs devoted to Messick.

Messick called this real-world attribute of tests “consequential validity.” He proposed that test validity be conceived as a unitary quality comprising both construct validity and consequential validity—both the technical and the ethical-social dimension. What shall it profit a test if it reaches an accurate conclusion yet injures the social goal it was trying to serve?

Many years ago I experienced the force of Messick’s observation before I knew that he was the source of it. It was in the early 1980s, and I had published a book on the valid testing of student writing. (The Philosophy of Compsition). At the time, Messick was the chief scientist at the Educational Testing Service, and under him a definitive study had been conducted to determine the most valid way to measure a person’s writing ability. Actual scoring of writing samples was notoriously inconsistent, and hence unfair. Even when graded by specially socialized groups of readers (the current system) there was a good deal of variance in the scoring.

ETS devised a test that probed writing ability less directly and far more reliably. It consisted of a few multiple-choice items concerned with general vocabulary and editorial acumen. This test proved to be not only far shorter and cheaper, it was also more reliable and valid. That is, it better predicted elaborately determined expert judgment of writing ability than did the writing samples.

There was just one trouble with this newly devised test. Used over time, student writing ability began to decline. The most plausible explanation was that although the test had construct validity it lacked consequential validity. It accurately predicted writing skill, but it encouraged classroom activity which diminished writing skill—a perfect illustration of Messick’s insight.

Under his intellectual influence there is now, again, an actual writing sample to be found on the verbal SAT. The purely indirect test which dispensed with that writing sample had had the unfortunate consequence of reducing the amount of student writing assigned in the schools, and hence reducing the writing abilities of students. A shame: the earlier test was not just more accurately predictive as an estimate, it was fairer, shorter, and cheaper. But ETS has made the right decision to value consequential validity above accuracy and elegance.