Value-added ratings are one important piece of a complete personnel system. But student test scores alone aren’t a sensitive enough measure to gauge effective teaching, nor are they diagnostic enough to identify areas of improvement. Teaching is multifaceted, complex work. A reliable evaluation system must incorporate other measures of effectiveness, like students’ feedback about their teachers and classroom observations by highly trained peer evaluators and principals.

True enough in terms of building a comprehensive evaluation system and one that works across all grades and subjects not just tested ones. But, value-add is a better predictor of persistent high and low-performance than Gates seems to be acknowledging here. The Gates Foundation’s own work shows that value add fares well compared to other methods. The political sensitivities around this entire issue and desire to create buy-in are obvious but we’re long overdue for a more straightforward conversation here as well.

Second, Gates doesn’t get into the more complicated issue of parental access to this information. I do think parents should be able to know if their child is being taught by a teacher with persistently low evaluations and should have access to the results of evaluations – the entire summative evaluation not just the value-add score in grades and/or subjects that are assessed with standardized tests. It’s an issue of equity because right now there is an informal information flow that allows some parents to make better choices about teachers than others. Surfacing information in a responsible way can help level the field and introduce another incentive for more robust human resources policies than are the norm today.

14 thoughts on “Bill Gates On Teacher Evaluations: You Publish, We Perish!”

I just read Mr. Gates’ editorial. Thank goodness for his common sense admission that there is no easy or cheap way to evaluate a teacher. And of course public humiliation will backfire badly when teachers try desperately to find jobs in “good” districts. Teachers and students need Mr.Gates’ help in educating the public.

Has anyone noticed that the outcry against value-added is coming mostly from teachers in low-income schools? Here’s why:

Any group test must be based on the content for the grade being tested. So a standardized test for sixth grade will have test items that mainly cover a sixth grade curriculum with some items below grade level and some items above.

Now if a sixth grade teacher teaches in a very low-income school there is a good chance that the majority of her students are two to four years below grade level. She could even have a few students who cannot read at all. So, even if she does an excellent job of teaching her students, it is likely that there will be very few items on the test that measure the child’s progress. If the student went from a first grade level in reading to a third grade level (good progress) this improvement would not be measured by the group test, making it appear that the child has learned nothing. The only way to measure the progress of a student so far behind would be through an individually administered wide-range test (pre and post) or (better yet) through individual comprehensive assessment by a professional.

I taught English language learners for many years and enjoyed every minute of it. However, if I had been publicly humiliated for their low standardized test scores, I would have asked for a transfer to the affluent part of town, and that’s what I predict other teachers will do if this “value-added” nonsense continues.

Value-added models fail to control for many non-teacher-controlled variables. As such, they cannot control high-stakes personnel decisions. At most, value-added models can be a screen to identify teachers for closer management or peer-evaluation review.

A value-added model uses students’ test scores in earlier years to predict what students’ test scores this year will be. The model then compares Teacher A’s students’ actual test scores for this year to Teacher A’s predicted test scores for this year. The model therefore largely controls for individual students’ academic ability.

However, the model does not control for other important variables:

1. The impact (negative or positive) on effective instruction of having more/fewer “problem” students in the class — i.e., disruptive students, emotionally-disturbed students, ESL students, learning-disabled students, high-functioning autistic students. Adding one problem student to a class will make the class harder to teach; adding 10 problem students to a class will make the class almost impossible to teach. As the number of problem students increases, the time/energy the teacher must devote to non-academic duties increases, almost geometrically. These problem students will adversely impact Teacher A’s student test scores well in excess of the impact of these problem students’ own test scores — indeed, some of the problem students (i.e., disruptives and high-functioning-autistic) might have higher than average test scores but still make effective instruction very difficult for the rest of the class.

2. The impact of widely-differing levels of academic/reading ability within the class — that is, although two classes migh have the same value-added model predicted scores, it’s much easier to teach the class where all the students have average academic/reading levels than the class where half the class are academic superstars and half the class are academic failures.

3. The impact of student load — individual class size and total number of students.

4. The impact of how many different preparations the teacher is assigned — i.e., it’s much easier to teach 5 biology classes than 2 biology classes, 2 chemistry classes, and 1 physics class.

5. The impact of teaching experience, including extent of experience teaching the particular grade or subject.

6. The impact of administrative support — i.e., to what extent there is help in the classroom (classroom aides, personal aides for disabled students, parent volunteers) and to what extent the principal/front office supports/undermines Teacher A’s efforts to establish/enforce behavior/academic standards.

7. The impact of how much parental involvement there is and how easy it is for the teacher to obtain parental involvement. Student X and Student Y may have comparable YR 1 test scores, but if X’s parents have more time/skills/concern to help their child with his/her school work than Y’s parents, comparable teacher effort will result in higher YR 2 scores for X than for Y. This factor is particularly important when comparing value-added scores for teachers in affluent suburban schools to teachers in inner-city schools. For many inner-city students/teachers, productive parental involvement simply does not exist — there is only one parent, she works two jobs, she does not speak English, it’s virtually impossible for the teacher to communicate with her, and she barely finished high school herself.

High-stakes-testing is not the answer to the problem of how to identify/remove poorly-performing teachers.

An obvious answer is peer-evaluation — Montgomery County (MD) has operated an extremely effective peer-evaluation program for over ten years, discharging over 200 teachers and triggering the resignation of over 300 additional teachers who were referred for peer-evaluation. The teachers union supports the program. Few discharges have been challenged in litigation. It doesn’t cost much. And it doesn’t have the huge adverse side effects of high-stakes testing (i.e., discouraging teacher cooperation, encouraging teaching to the test, narrowing the curriculum, encouraging cheating, discouraging teachers from accepting assignment to low-income schools or “problem” students).

Wouldn’t allowing parents to request specific teachers make estimates of value-added even more biased? Although students are not currently randomly assigned to teachers, allowing parents to request teachers would exacerbate this problem. A supposedly high value-added teacher would be more the result of a self-selecting group of more informed and motivated parents and students. Consequently, the lower value added estimated could be a direct result of having less motivated and informed students and parents. You could try to control for these factors but it is quite difficult to measure how engaged or informed students and parents are.

I largely agree with Labor Lawyer points, though some of the models do attempt to control for some of the variables s/he mentions, such as peer effects and classroom composition using rudimentary proxies.

Andy, I think you’re being sloppy in saying that: “value-add is a better predictor of persistent high and low-performance than Gates seems to be acknowledging here.” Actually value-add is a better predictor of FUTURE VALUE ADD than some other measures. Value add is just one measure of “teacher performance,” and too many pundits and policy makers fall into this sloppy equivalence, confusing the measure/proxy with the actually construct it is meant to measure. There are many other important parts of a teachers job description that are not captured in value add–e.g., how well they develop students social skills, creative/divergent thinking, how well they collaborate and coordinate with other teachers, their leadership in the building, all sorts of things.

In some ways its not surprising value add is a good predictor for future value add–they measure the same aspect of teaching performance. In fact it’s quite surprising how unstable value add across time and across sections of the same course taught by the same teacher.

VAM is best used for large scale research–i.e., comparing teacher prep programs–and should only be used very cautiously for personnel decisions. If it has to be used for personnel decisions, it should best be used as a screening device in the way Labor Lawyer suggests. As a measure it is most reliable in identifying teachers at the extremes of the distribution–very high “performers” in terms of raising test scores, and very low “performers.”

What’s worrisome is how a broader policy can skew teacher behavior and provide perverse incentives in real life. High stakes use of VAM will likely lead to even more teaching to test, teacher exit from urban schools and challenging teaching placements, etc. And the fact is, in most states and districts we don’t have the data needed to use VAM in a responsible way. The dirty little secret is how bad the data systems are–limits in the number of grades and subjects tested, inaccurate links between teachers and student rosters, poor data controls and auditing functions.

Even if your goal is to eliminate a portion of the lowest performing teachers–a worthy goal–be wary of unintended system-wide effects and perverse incentives. NCLB did tons to really focus attention on underserved populations, but the design of the legislation has had lots of perverse consequences. Maybe in the end, some think it’s a net plus. But it’s always important to assess unintended consequences.

I would say the same thing about rhetoric re teachers. I know many so-called reforms don’t intend to attack or disparaging teachers and don’t believe they are doing so. But intentions don’t really matter. Talk to most teachers, and they feel they are under attack and many are demoralized. It’s the actual consequences that matter not the intentions. If you are truly outcomes oriented, you should know this.

I would add even in the are of producing student learning, VAM is a proxy for much broader construct (increase in student knowledge, skills, and understanding). So, even if you define teachers job solely in terms of increasing student learning, it’s important to distinguish between the measure and the larger construct for which it is a proxy.

This is the highest quality series of comments on an educational blog that I’ve ever seen (and I read a lot of these these days). Congratulations to all of the commentators above for their keen insights and general civility, something long lost on most blogs.

I second the majority of the points made by LaborLawyer, Linda/Retired Teacher and Catlike, above. As all the commenters here pointed out, there are MANY problems with using VAM scores for high-stakes decisions about teachers’ performance. Maybe someone higher up in the education policy world will take note of these points and STOP the madness! (Bill Gates’ NY Times piece is a good start.)

And, I would like to ask why the NYT and certain other media outlets wanted to get the teacher reports in the first place. This is a big story. Huge. Why aren’t the teachers going on strike or at least protesting in the streets? I mean, I don’t usually wear a tin foil had but there is something rotten here…

The problem of where to draw the line in deciding to include in these models things that teachers can control and what they can’t control has no easy answer, as Doug Harris makes clear in his book on VAM in education.

Going too far in one direction runs the risk of expecting lower achievement for poor children and minority children and of reducing incentives for schools to look for solutions. Going to far in the other direction leaves teachers throwing up their hands in despair. We’re going to have to work to get it right, and it’s going to take working together. No useful public purpose is served by propaganda one way or the other.

"designed to cut through the fog and direct specialists and non-specialists alike to the center of the liveliest and most politically relevant debates on the future of our schools"
-- The New Dem Daily

"peppered with smart and witty comments on the education news of the day"
-- Education Gadfly