In my last post, I tossed out a not-quite-baked idea for a new academic major for elementary teachers: World Studies. That major would ensure teachers have the broad range of knowledge they need to introduce our students to the world via literature, history, geography, science, mathematics, and the arts.

One point that is obvious to the Core Knowledge community—yet somehow shocking and mysterious to most of the education world—is that before we could decide what academics future teachers need to study in their prep programs, we need to decide what all elementary students must learn. Drum roll: We need a core K-5 curriculum.

Refusing to identify and teach essential knowledge has consequences. (Image courtesy of Shutterstock.)

Some people might think we kinda sorta have a core curriculum with the Common Core standards. In math, I would agree—the actual math that has to be mastered each year is specified. It’s far from a full-blown curriculum, but it does provide concrete guidance on what math to teach. However, the Common Core ELA standards are nearly content free. They indicate the reading, writing, listening, and speaking skills students need to develop, but they do not outline content to be taught grade by grade. Instead, the ELA standards call on schools to create content-rich curricula infused with nonfiction texts, thereby systematically building broad knowledge across academic subjects.

I am grateful that the Common Core ELA standards explain the benefits of building knowledge. In implementation, schools need to realize that when they forge ahead without any shared core of content for each grade, they miss out on the many benefits of a coherent educational system. I’ve written about the problems with student mobility; today I want to share a terrific, little-known article by education professor David Cohen. In “Learning to Teach Nothing in Particular,” he explains the massive leaps forward we could make in student and teacher evaluation if only we specified what students are supposed to know:

Because local control and weak government were the foundations of U.S. public education, most of our school systems never developed the common instruments that are found in many national school systems…. These include a common curriculum or curriculum frameworks, common examinations tied to the curriculum, teacher education grounded in learning to teach the curriculum that students are to learn, and a teaching force whose members succeeded in those curriculum-based exams as students, among other things. Teachers who work with such infrastructure have instruments that they can use to set academic tasks tied to curriculum and assessment. They have a common vocabulary with which they can work together to identify, investigate, discuss, and solve problems of teaching and learning. Hence, they can have professional knowledge and skill, held in common….

Because there is no common infrastructure for U.S. public education, it has developed several anomalous features. One of the most important concerns testing: because there is no common curriculum, it is impossible to devise tests that assess the extent of students’ mastery of that curriculum. So, even though we’ve been testing student learning for nearly 100 years, only isolated programs (such as Advanced Placement and International Baccalaureate) have tested whether students learned what they were supposed to have been taught. In the early 1900s, when E. L. Thorndike and his colleagues and students invented tests of students’ academic performance, they devised tests that were designed to be independent of any particular curriculum. Nonetheless, those tests, and more recently developed similar tests, were and are used to assess students’ progress in learning. That has to rank as one of the strangest creations in the history of education.

Teacher education is a second anomaly: absent a common curriculum, teachers-in-training could not learn how to teach it, let alone how to teach it well. Hence, teacher education consists of efforts to teach future teachers to teach no particular curriculum. This is very strange, since to teach is always to teach something, but the governance structure of U.S. education has long forbidden the specification of what that something would be. For the most part, teacher education has been accommodating: typically, teacher candidates are taught how to teach no particular version of their subjects. That arrangement creates no incentives for those training to be teachers to learn, relatively deeply, what they would teach, nor does it create incentives for teacher educators to learn how to help teacher candidates learn how to teach a particular curriculum well. Instead, it offers incentives for them to teach novices whatever the teacher educators think is interesting or important (which often is not related to what happens in schools) or to offer a generic sort of teacher education. Most teachers report that, after receiving a teaching degree, they arrived in schools with little or no capability to teach particular subjects….

Absent a common curriculum, common assessments, common measures of performance, and teacher education tied to these things, it will be terribly difficult to devise technically valid and educationally usable means to judge and act on teaching performance. Building a coherent educational system would be a large task, but not nearly as daunting as trying to solve our educational problems without building such a system. Without standards and measures of quality practice—grounded in linked curriculum, assessments, and teacher education—it will be impossible to build a knowledgeable occupation of teaching, and a knowledgeable occupation is the only durable solution to the problem of quality in teaching.

So far this week E. D. Hirsch has taught us that higher-order thinking depends on knowledge, that highly mobile students suffer acutely from our national refusal to establish a core of common content, and that there is an identifiable body of specific knowledge that facilitates communication. Now, on Hirsch’s birthday, we examine his game-changing policy prescription: curriculum-based reading tests.

Reading tests are attacked for cultural bias and other faults, but such complaints are unfounded. The tests are fast and accurate indexes of real-world reading ability. They correlate extremely well with one another and with actual capacity to learn and communicate. They consist, after all, of written passages, which students are to read and then answer questions on; that is, students are asked to exercise the very skill at issue…. The much more reasonable complaint is that an emphasis on testing has caused schools to devote too much time to drills and test preparation, with a consequent narrowing of the curriculum….

Yet the fault lies not with the tests but with the school administrators who have been persuaded that it is possible to drill for a reading test—on the mistaken assumption that reading is a skill like typing and that once you know the right techniques you can read any text addressed to a general audience. The bulk of time in early language-arts program today is spent practicing these abstract strategies on an incoherent array of uninformative fictions. The opportunity costs have been enormous. Schools are wasting hours upon hours practicing drills that are supposed to improve reading but that are actually depriving students of knowledge that could enhance their reading comprehension….

Here is the beginning of an actual passage from a New York State reading test for fourth grade:

There is a path that starts in Maine and ends in Georgia, 2,167 miles later. This path is called the Appalachian Trail. If you want, you can walk the whole way, although only some people who try to do this actually make it, because it is so far, and they get tired. The idea for the trail came from a man named Benton MacKaye. In 1921 he wrote an article about how people needed a nearby place where they could enjoy nature and take a break from work. He thought the Appalachian Mountains would be perfect for this.

The passage goes on for a while, and then come the questions. The first question, as usual, concerns the main idea:

This article is mostly about

1. how the Appalachian Trail came to exist.

2. when people can visit the Appalachian Trail.

3. who hikes the most on the Appalachian Trail.

4. why people work together on the Appalachian Trail.

Many educators see this question as probing the general skill of “finding the main idea.” It does not. Try to put yourself in the position of a disadvantaged fourth grader who knows nothing of hiking, does not know the difference between an Appalachian-type mountain and a Himalayan-type mountain, does not know where Maine and Georgia are, and does not grasp what it means to “enjoy nature.” Such a child, though much trained in comprehension strategies, might answer the question incorrectly. The student’s more advantaged counterpart, not innately smarter, just happens to be familiar with hiking in the Appalachians, has been to Maine and Georgia, and has had a lot of experience “enjoying nature.” The second student easily answers the various questions correctly. But not because he or she practiced comprehension strategies; this student has the background knowledge to comprehend what the passage is saying….

It has been shown decisively that subject-matter knowledge trumps formal skill in reading and that proficiency in one reading-comprehension task does not necessarily predict skill in another. Test makers implicitly acknowledge this by offering, in a typical reading test, as many as ten passages on varied topics. (If reading were a knowledge-independent skill, a single passage would suffice.)… Contrary to appearances and educators’ beliefs, these reading tests do not test comprehension strategies. There usually are questions like “What is the main idea of this passage?” but such a question probes ad hoc comprehension, not some general technique of finding the main idea. Reading comprehension is not a universal, repeatable skill like sounding out words or throwing a ball through a hoop. “Reading skill” is rather an overgeneralized abstraction that obscures what reading really is: an array of separate, content-constituted skills such as the ability to read about the Appalachian Mountains or the ability to read about the Civil War….

A reading test is inherently a knowledge test. Scoring well requires familiarity with the subjects of the test passages. Hence the tests are unfair to students who, through no fault of their own, have little general knowledge. Their homes have not provided it, and neither have the schools. This difference in knowledge, not any difference in ability, is the fundamental reason for the reading gap between white and minority students. We go to school for many years partly because it takes so long to build up the vast general knowledge and vocabulary we need to become mature readers.

Because this knowledge-gaining process is slow and cumulative, the type of general reading test now in use could be fair to all groups only above fifth or sixth grade, and only after good, coherent, content-based schooling in the previous grades. I therefore propose a policy change that would at one stroke raise reading scores and narrow the fairness gap. (As a side benefit, it would induce elementary schools to impart the general knowledge children need.) Let us institute curriculum-based reading tests in first, second, third, and fourth grades—that is to say, reading tests containing passages based on knowledge that children will have received directly from their schooling. In the early grades, when children are still gaining this knowledge slowly and in piecemeal fashion, it is impossible to give a fair test of any other sort….

We now have an answer to our question of how to enable all children to ace a reading test. We need to impart systematically—starting in the very earliest grades by reading aloud to students, then later in sequenced self-reading—the general knowledge that is taken for granted in writing addressed to a broad audience. If reading tests in early grades are based on a universe of pre-announced topics, general knowledge will assuredly be built up. By later grades, when the reading tests become the standard non-curriculum one, such as the NAEP tests, reading prowess will have risen dramatically.

Policy makers say they want to raise reading scores and narrow the fairness gap. But it seems doubtful that any state can now resist the anti-curriculum outcry that would result from actually putting curriculum-based testing into effect. Nonetheless, any state or district that courageously instituted knowledge- and curriculum-based early reading tests would see a very significant rise in students’ reading scores in later grades.

States would also see impressive results right away on the curriculum-based tests since the passages would be about content that all students had actually been taught. Just imagine: With curriculum-based tests, “test prep” would consist of studying literature, history, science, and the arts. Bringing that imaginary world to life relies on our leaders working together. So, this birthday retrospective ends with a call to the left and right, drawn from pages 186 – 187 of the Making of Americans.

One of the gravest disappointments I have felt in the twenty-fine years that I have been actively engaged in educational reform is the frustration of being warmly welcomed by conservatives but shunned by fellow liberals. The connection of the anti-curriculum movement with the Democratic Party is an accident of history, not a logical necessity. All the logic runs the other way. A dominant liberal aim is social justice, and a definite core curriculum in early grades is necessary to achieve it. Why should conservatives alone favor solid content while my fellow liberals buy into the rhetoric of the anti-curriculum theology that works against the liberal aims of community and equality? Practical improvement of our public education will require intellectual clarity and a depolarization of the issue. Left and right must get together on the principle of common content.

* For the endnotes, please refer to the book.

Do you have a birthday message for E. D. Hirsch or favorite quote from him? Please share it with all of us in the comments.

You may also be interested in other posts in this birthday retrospective:

The chief practical impact of NCLB has been its principle of accountability. Adequate yearly progress, the law stated, must be determined by test scores in reading and math—not just for the school as a whole, but for key groups of students.

Now, a decade later, the result of the law, as many have complained, has been a narrowing of the school curriculum. In far too many schools, the arts and humanities, and even science and civics, have been neglected—sacrificed on the altar of tests without any substantial progress nationwide on the tests themselves. It is hard to decide whether to call NCLB a disaster or a catastrophe.

But I disagree with those who blame this failure on the accountability principle of NCLB. The law did not specify what tests in reading and math the schools were to use. If the states had responded with valid tests—defined by Messick as tests that are both accurate and have a productive effect on practice—the past decade would have seen much more progress.

Since NCLB, NAEP’s long-term trend assessment shows substantial increases in reading among the lowest-performing 9-year-olds—but nothing comparable in later grades. It also shows moderate increases in math among 9- and 13-year-olds.

So, it seems that a chief educational defect of the NCLB era lay in the later-grades reading tests; they simply do not have the same educational validity of the tests in early grades reading and in early- and middle-grades math.

****

It’s not very hard to make a verbal test that predicts how well a person will be able to read. One accurate method used by the military is the two-part verbal section of the multiple-choice Armed Forces Qualification Test (AFQT), which is known for its success in accurately predicting real-world competence. One section of the AFQT Verbal consists of 15 items based on short paragraphs on different subjects and in different styles to be completed in 13 minutes. The other section of the AFQT Verbal is a vocabulary test with 35 items to be completed in 11 minutes. This 24-minute test predicts as well as any verbal test the range of your verbal abilities, your probable job competence and your future income level. It is a short, cheap and technically valid test. Some version of it could even serve as a school-leaving test.

Educators would certainly protest if that were done—if only because such a test would give very little guidance for classroom practice or curriculum. And this is the nub of the defects in the reading tests used during the era of NCLB: They did not adequately support curriculum and classroom practice. The tests in early-grades reading and in early- and middle-grades math did a better job of inducing productive classroom practice, and their results show it.

Early-grades reading tests, as Joseph Torgesen and his colleagues showed, probe chiefly phonics and fluency, not comprehension. Schools are now aware that students will be tested on phonics and fluency in early grades. In fact, these crucial early reading skills are among the few topics for which recent (pre-Common Core) state standards had begun to be highly specific. These more successful early reading tests were thus different from later ones in a critical respect: They actually tested what students were supposed to be taught.

Hence in early reading, to its credit, NCLB induced a much greater correlation than before between standards, curriculum, teaching and tests. The tests became more valid in practice because they induced teachers to teach to a test based on a highly specific subject matter—phonics and fluency. Educators and policymakers recognized that teaching swift decoding was essential in the early grades, tests assessed swift decoding, and—mirabile dictu—there was an uptick in scores on those tests.

Since the improvements were impressive, let’s take a look at what has happened in over the past decade among the lowest performing 9-year-olds on NAEP’s long-term trend assessment in reading.

Note that there is little to no growth among higher-performing 9-year-olds, presumably because they had already mastered phonics and fluency.

Similarly, early- and middle-grades math tests probed substantive grade-by-grade math knowledge, as the state standards had become ever more specific in math. You can see where I’m going: Early reading and math improved because teachers typically teach to the tests (especially under NCLB-type accountability pressures), and the subject matter of these tests began to be more and more defined and predictable, causing a collaboration and reinforcement between tests and classroom practice.

In later-grades reading tests, where we have failed to improve, the tests have not been based on any clear, specific subject matter, so it has been impossible to teach to the tests in a productive way. (The lack of alignment between math course taking and the NAEP math assessment for 17-year-olds is similarly problematic.) Of course, there are many reasons why achievement might not rise. But specific subject matter, both taught and tested, is a necessary—if not sufficient—condition for test scores to rise.

In the absence of any specific subject matter for language arts, teachers, textbook makers, and test makers have conceived of reading comprehension as a strategy rather than as a side effect of broad knowledge. This inadequate strategy approach to language arts is reflected in the tests themselves. I have read many of them. An inevitable question is something like this: “The main idea of this passage is….” And the theory behind such a question is that what is being tested is the ability of the student to strategize the meaning by “questioning the author” and performing other puzzle-solving techniques to get the right answer. But, as readers of this blog know, that is not what is being tested. The subject matter of the passage is.

This mistaken strategy-focused structure has made these tests not only valueless educationally, but worse—positively harmful. Such tests send out the misleading message that reading comprehension is chiefly strategizing. That idea has dominated language arts instruction in the past decade, which means that a great deal of time has been misspent on fruitless test-taking activities. Tragically, that time could have been spent on science, humanities and the arts—subjects that would have actually increased reading abilities (and been far more interesting).

The only way that later-grades reading tests can be made educationally valid is by adopting the more successful structure followed in early reading and math. An educationally valid test must be based on the specific substance that is taught at the grade level being tested (possibly with some sampling of specifics from previous and later grades for remediation and acceleration purposes). Testing what has been taught is the only way to foster collaboration and reinforcement between tests and classroom practice. An educationally valid reading test requires a specific curriculum—a subject of further conversations, no doubt.

In a prior post I described Messick’s unified theory of test validity, which judged a test not to be valid if its practical effects were null or deleterious. His epoch-making insight was that the validity of a test must be judged both internally for accuracy and externally for ethical and social effects. That combined judgment, he argued, is the only proper and adequate way of grading a test.

In the era of the No Child Left Behind law (2001), the looming specter of tests has been the chief determiner of classroom practice. This led me to the following chain of inferences: Since 2001, tests have been the chief determiners of educational practices. But these tests have failed to induce practices that have worked. Hence, according to the Messick principle, the tests that we have been using must not be valid. Might it be that a new, more Messick-infused approach to testing would yield far better results?

First, some details about the failure of NCLB. Despite its name and admirable impulses it has continued to leave many children behind:

NCLB has also failed to raise verbal scores. The average verbal level of school leavers stood at 288 when the law went into effect, dropped to 283 in 2004, and stood at 286 in 2008.

Yet this graph shows an interesting exception to this pattern of failure, and it will prove to be highly informative under Messick’s principle. Among 4th graders (age 9) the test-regimen of NCLB did have a positive impact.

Moreover, NCLB also had positive effects in math:

This contrast between the NCLB effects in math and reading is even more striking if we look at the SAT, where the test takers are trying their best:

So let’s recap the argument. Under NCLB, testing in both math and reading has guided school practices. Those practices were more successful in math and in early reading than in later reading. According to the Messick principle, therefore, reading tests after grade 4 had deleterious effects and cannot have been valid tests. How can we make these reading tests more valid?

A good answer to that question will help determine the future progress of American education. Tune in.

Everyone who is anyone in the field of testing actually has heard of Samuel Messick. The American Psychological Association has instituted a prestigious annual scientific award in his name, honouring his important work in the theory of test validity. I want to devote this, my first-ever blog post, to one of his seminal insights about testing. It’s arguable that his insight is critical for the future effectiveness of American education.

My logic goes this way: Every knowledgeable teacher and policy maker knows that tests, not standards, have the greater influence on what principals and teachers do in the classroom. My colleagues in Massachusetts—the state that has the most effective tests and standards—assure me that it’s the demanding, content-rich MCAS tests that determine what is taught in the schools. How could it be otherwise? The tests determine whether a student graduates or whether a school gets a high ranking. The standards do vaguely guide the contents of the tests, but the tests are the de facto standards.

It has been and will continue to be a lively blog topic to argue the pros and cons of the new Common Core State Standards in English Language Arts. But so far these arguments are more theological than empirical, since any number of future curricula—some good, some less so—can fulfill the requirements of the standards. I’m sure the debates over these not-yet-existent curricula will continue; so it won’t be spoiling anyone’s fun, if I observe that these heated debates bear a resemblance to what was called in the Middle Ages the Odium Theologicum over unseen and unknown entities. Ultimately these arguments will need to get tied down to tests. Tests will decide the actual educational effects of the Common Core Standards.

But Samuel Messick has enunciated some key principles that will need to be heeded by everyone involved in them if our schools are to improve in quality and equity—not only in the forty-plus states that have agreed to use the common core standards—but also in those states that have not. In all fifty states, tests will continue to determine classroom practice and hence the future effectiveness of American education.

In this post, I’ll sketch out one of Messick’s insights about test validity. In a second post, I’ll show how ignoring those insights has had deleterious effects in the era of NCLB. And in a third, and last on this topic, I’ll suggest policy principles to avoid ignoring the scientific acumen and practical wisdom of Samuel Messick in the era of the Common Core Standards.

******

Messick’s most distinctive observation shook up the testing world, and still does. He said that it was not a sufficient validation of a test to show that it exhibits “construct validity.” This term of art means that the test really does accurately estimate what it claims to estimate. No, said Messick, that is a purely technical criterion. Accurate estimates are not the only or chief function of tests in a society, In fact, accurate estimates can have unintended negative effects. In the world of work they can unfairly exclude people from jobs that they are well suited to perform. In the schools “valid” tests may actually cause a decline in the achievement being tested – a paradoxical outcome that I will stress in the three blogs devoted to Messick.

Messick called this real-world attribute of tests “consequential validity.” He proposed that test validity be conceived as a unitary quality comprising both construct validity and consequential validity—both the technical and the ethical-social dimension. What shall it profit a test if it reaches an accurate conclusion yet injures the social goal it was trying to serve?

Many years ago I experienced the force of Messick’s observation before I knew that he was the source of it. It was in the early 1980s, and I had published a book on the valid testing of student writing. (The Philosophy of Compsition). At the time, Messick was the chief scientist at the Educational Testing Service, and under him a definitive study had been conducted to determine the most valid way to measure a person’s writing ability. Actual scoring of writing samples was notoriously inconsistent, and hence unfair. Even when graded by specially socialized groups of readers (the current system) there was a good deal of variance in the scoring.

ETS devised a test that probed writing ability less directly and far more reliably. It consisted of a few multiple-choice items concerned with general vocabulary and editorial acumen. This test proved to be not only far shorter and cheaper, it was also more reliable and valid. That is, it better predicted elaborately determined expert judgment of writing ability than did the writing samples.

There was just one trouble with this newly devised test. Used over time, student writing ability began to decline. The most plausible explanation was that although the test had construct validity it lacked consequential validity. It accurately predicted writing skill, but it encouraged classroom activity which diminished writing skill—a perfect illustration of Messick’s insight.

Under his intellectual influence there is now, again, an actual writing sample to be found on the verbal SAT. The purely indirect test which dispensed with that writing sample had had the unfortunate consequence of reducing the amount of student writing assigned in the schools, and hence reducing the writing abilities of students. A shame: the earlier test was not just more accurately predictive as an estimate, it was fairer, shorter, and cheaper. But ETS has made the right decision to value consequential validity above accuracy and elegance.

Via Alexander Russo comes word of a “misguided war” against standardized testing, and a backlash against the backlash. “Standardized testing is rarely fun — and it could almost certainly be improved — but it’s not nearly as antithetical to real, deep learning as its detractors suggest,” writes Anna North at the blog Jezebel, who scoffs at well-off parents refusing to let their kids sit for tests. Such protests

“…. run the risk of deepening the divide between haves and have-nots that continues to plague public education — and pretty much every other aspect of society. Any attempt to scuttle standardized testing needs to acknowledge that even if the tests are problematic, the deficits they attempt to address are real — and any alternative approach needs to face these deficits, not just walk away from them.”

North slightly misdiagnoses the issue. Personally, I have no problem with tests, per se. But you’d have to be naive to dismiss the impact preparing for those tests have had on the children North and everyone else purports to care so deeply about. Talk to someone who has taught in a low-performing school and you’ll almost certainly hear stories about prodigious amounts of time sacrificed on the altar of practice tests and language arts lessons in “test sophistication.” At my South Bronx elementary school, we had a Teachers College consultant who encouraged us to ”teach tests as a genre of literature.” But even that pales in comparison to a grad student of mine who was mandated to spend two hours per day on test prep from the first day of school.

Testing and accountability are unlikely to disappear. Boycott the test? Perhaps, but if I were a parent activist, I would march into the school office the first day of school with the following bargain: “I’m sure you agree the best test prep is great teaching and a robust curriculum, Ms. Principal. So let’s keep our focus right there. Don’t worry about spending my child’s time and your budget dollars on test prep materials. Because if they show up in our kids classrooms, we can promise our kids won’t be showing up for the test.”

Keep an eye on New York State education commissioner David Steiner, who is gearing up to implement a long overdue reform: establishing a link between test scores and college readiness.

Harvard’s Daniel Koretz, at Steiner’s urging, has been looking at the correlation between New York’s eighth-grade test scores and high school Regents exam scores. Notes the Buffalo News: ”The conclusion: Students in New York State are moving through elementary, middle and high school with test scores they believe to be adequate, but once they get to college, they find they are not prepared.” That’s not a complete shock given the boxcar numbers of college freshman who need remediation once they arrive on campus. But the New York Post’s Yoav Gonen points out what will surely be the most repeated fact from Koretz’s forthcoming study: eighth-graders who score a 3 out of 4 on state math and reading tests have just a 52 percent chance of graduating high school, even though they’ve been told they’re on track.

Let that rattle around inside your head for a moment: A child who is deemed proficient in 8th grade has a chance only slightly better than a coin toss of graduating high school just four years later. “We’ve been calling that ‘proficient,’ ” state Board of Regents Chancellor Merryl Tisch told The Post’s editorial board. “We were giving out misleading information.”

Gee, ya think?

The study is to be released Monday, but anyone who has taught in New York in the last several years can’t be surprised. For years, I saw 5th graders come into my Bronx classroom who were ostensibly on grade level yet demonstrated little command of basic arithmetic. That was plenty persuasive that all that glitters isn’t gold.

Steiner’s insistence that test scores should actually mean something is clearly going to rattle some cages, and prompt a long hard look at where school districts in New York have made real gains and where they haven’t. Buffalo’s school superintendent blasted Steiner and his deputy John King last week for focusing on more rigorous tests. ”I think they’re two people who don’t know what they’re doing,” James A. Williams told the Buffalo News. “A more rigorous test is not going to improve student achievement. It’s not going to improve the graduation rate. I think it’s ridiculous.”

I don’t follow Williams’ complaint. By my read, Steiner isn’t talking about testing our way to proficiency. He’s talking about how test scores should be indicative of real-world proficiency. As I’ve argued in this space before, if we’re going to insist on viewing everything in education through the prism of test scores, those scores have to be meaningful and indicative of real-world proficiency. Steiner, King and Tisch deserve all the credit in the world for taking this on.

Here’s something I didn’t know: the same scanners that score standardized tests can be used to count the erasures in which answers are changed from wrong to right. Too many changes and a school or teacher can come under suspicion of cheating. That’s the case in Georgia, where nearly 200 schools are being investigated following a study by the Governor’s Office of Student Achievement, the New York Times reports.

The study determined the average number of wrong-to-right erasures statewide for each grade and subject, and flagged any classroom with an unusually high number. For example, in fourth-grade math, students on average changed 1.8 answers from wrong to right, while one classroom that was flagged as suspicious had more than 6 such changes per student. Four percent of schools were placed in the “severe concern” category, which meant that 25 percent or more of a school’s classes were flagged. Six percent were in the “moderate concern” category, which meant that 11 percent to 25 percent of its classes were flagged, and 10 percent raised “minimal concern,” meaning 6 percent to 10 percent of its classes were flagged.

At 27 schools, 21 of which were in the Atlanta district, more than half the classes were flagged, and at four Atlanta schools more than 80 percent of the classes were flagged, the Times reports.

For a fascinating read, go to Schneier on Security, a blog on computer security issues, where commenters are picking apart the Georgia investigation. “What study has been done showing that the percentage of answers changed from wrong to right is a good indicator of cheating?” one asks.

I’m HIGHLY skeptical of the ability of scanner to determine whether or not an answer was changed. If you look at the numbers in the report closely, you’ll see that according to the scanner almost all changes were wrong to right; there were very few wrong to wrong answers recorded. That alone strikes me as wildly improbable. One big flaw of this study is that there is no evidence that took a random sample of the recorded changes and *visually inspected* those documents to determine if what the scanner was recording was in fact accurate. Don’t misunderstand. I am sure there are teachers who cheat. I’m just skeptical that this study is anything other than a witch hunt.

Other commenters suggest it would be child’s play to defeat scanning for erasures: simply fill in all the answers and erase the wrong ones.

…a smart teacher would also create erasures on wrong answers that they haven’t changed to defeat the wrong->right/right->wrong statistic. Could the analyst infer that the teacher was cheating just because an increase in erasures where there is no discernable bias in the erasures themselves? This is quickly becoming a counter-intelligence exercise.

The best comment comes from someone outside education–and the U.S. “Can someone please explain this topic to us non-Americans? I don’t understand what this is about,” he writes. ”Back when I was in school the students cheated, not teachers. Why would they do that? Makes no sense.”

In his New York Times column praising the Obama administration’s “quiet revolution” on education, David Brooks writes ”there is clear evidence that good teachers produce consistently better student test scores.” I ask this question not rhetorically, but in earnest: what is the “clear evidence” to which Brooks refers? Is there a study that defines good teaching, identifies good teachers and THEN looks at the impact of those teachers on test scores?

If we define good teaching as the ability to raise tests scores, Brooks’ assertion is merely a tautology.