How reformers (unfortunately) get captivated by experimental technology

Because teacher evaluation is such an important part of school reform and at the center of the Chicago teachers strike, I have published several pieces on the issue in the last week, here and here and here. Below is a new historical look on the subject. It was written by Jack Schneider and Ethan Hutt.

Schneider is an assistant professor of education at the College of the Holy Cross and author of “Excellence For All: How a New Breed of Reformers Is Transforming America’s Public Schools.” Hutt is a doctoral candidate at the Stanford University School of Education and has been named “one of the world’s best emerging social entrepreneurs” by the Echoing Green Foundation. A version of this appeared on Larry Cuban’s blog on School Reform and Classroom Practice. Cuban is a former high school social studies teacher (14 years, including seven in Washington D.C.), district superintendent (seven years in Arlington, Virginia) and professor emeritus of education at Stanford University, where he has taught for more than 20 years.

By Jack Schneider and Ethan Hutt

Public school district leaders in Chicago, following the lead of reformers in cities nationwide, are pushing for a “value-added” evaluation system. Unlike traditional forms of evaluation, which rely primarily on classroom observations, policymakers in Chicago propose to quantify teacher quality through the analysis of student achievement data. Using cutting-edge statistical methodologies to analyze standardized test scores, the district would determine the value “added” by each teacher and use that information as a basis for making personnel decisions.

Teachers are opposed to this approach for a number of reasons. But educational researchers are generally opposed to it, too, and their reasoning is far less varied: value-added evaluation is unreliable.

As researchers have shown, value-added methodologies are still very much works-in-progress. Scholars like Heather Hill have found that value-added scores correlate not only with quality of instruction, but also with the population of students they teach. Researchers examining schools in Palm Beach, Fla., discovered that more than 40 percent of teachers scoring in the bottom decile one year, according to value-added measurements, somehow scored in the top two deciles the following year. And according to a recent Mathematica study, the error rate for comparing teacher performance was 35 percent. Such figures could only inspire confidence among those working to suspend disbelief.

And yet suspending disbelief is exactly what reformers are doing. Instead of slowing down the push for value-added, they’re plowing full steam ahead. Why?

The promise of a mechanized quality-control process, it turns out, has long captivated education reformers. And while the statistical algorithm in question right now in Chicago happens to be quite new, reformer obsession with ostensibly standardized, objective, and efficient means of gauging value is, in fact, quite old. Unfortunately, as the past reveals, plunging headlong into a cutting-edge measurement technology is also quite problematic.

Example 1:

Nearly a century ago, school leaders saw a breakthrough in measurement technology as a way of measuring teacher quality. By using newly designed IQ tests to assess “native ability,” school administrators could translate student scores on standardized tests into measures of teacher effectiveness. Of course, not everyone was on board with this effort. As one school superintendent noted, some educators were concerned “that the means for making quantitative and qualitative measures of the school product” were “too limited to provide an adequate basis for judgment.” But the promise of the future was too tempting and, as he argued, though it was “impossible” to measure teacher quality rigorously, “a good beginning” had been made. Reformers plowed ahead.

The IQ movement was deeply flawed. The instruments were faulty and culturally biased. The methodology was inconsistent and poorly applied. And the interpretations were horrifying. “If both parents are feeble-minded all the children will be feeble-minded,” wrote H.H. Goddard in 1914. “Such matings,” he reasoned, “should not be allowed.” Others drew equally shocking conclusions. E.G. Boring, a distinguished psychologist of the period, wrote in 1920 that “the average man of many nations is a moron.” The average Italian-American adult, he calculated, had a mental age of 11.01 years. African-Americans were at the bottom of his list, with an average mental age of 10.41.

Value-added proponents like to make the argument that “some data is better than no data.” Yet in the case of the mental testing movement, that was patently false. For the hundreds of thousands of students tracked into dead-end curricula, to say nothing of the forced sterilization campaigns that took place outside of schools, reform was imprudent and irresponsible.

But one need not go back so far into the educational past for examples of half-baked quality-control reforms peddled by zealous policymakers.

Example 2:

In the 1970s, 37 states hurriedly adopted “minimum competency testing” legislation and implemented “exit examinations,” ignoring the concerns of experts in the field. As one panel of scholars observed, the plan was “basically unworkable” and exceeded “the present measurement arts of the teaching profession.” Reformers, however, were not easily dissuaded.

The result of the minimum competency movement was the development of a high stakes accountability regime and years of litigation. Reformers claimed that the information revealed by such tests would provide the sunlight and shame that schools needed to improve. Yet while they awaited that outcome, thousands of students suffered the indignity of being labeled “functionally illiterate,” were forced into remedial classes, and had their diplomas withheld despite having enough units to graduate—all on the basis of a test that leading scholars described as “an indefensible technology.”

Contrary to what reformers claimed, the information provided by such deeply flawed tests did little to improve students’ learning or the quality of schools.

Today’s policymakers, like those of the past, want to adopt new tools as swiftly as possible. Even flawed value-added measures, they argue, are better than nothing. Yet the risks of early adoption, as the past reveals, can far outweigh the rewards. Simply put, acting rashly on incomplete information makes mistakes more costly than necessary.

Today’s value-added boosters believe themselves to be somehow different — acting on better incomplete information. Yet the idea that incomplete information can be good strains credulity.

Good technologies do tend to improve over time. And if advocates of value-added models are confident that they can work out the kinks, they should continue to judiciously experiment with them. In the meantime, however, such models should be kept out of all high-stakes personnel decisions. Until we can make them sufficiently work, they shouldn’t count.

Comments our editors find particularly useful or relevant are displayed in Top Comments, as are comments by users with these badges: . Replies to those posts appear here, as well as posts by staff writers.

To pause and restart automatic updates, click "Live" or "Paused". If paused, you'll be notified of the number of additional comments that have come in.

Comments our editors find particularly useful or relevant are displayed in Top Comments, as are comments by users with these badges: . Replies to those posts appear here, as well as posts by staff writers.