For sure the ‘reformers’ have won a battle and have unfairly humiliated thousands of teachers who got inaccurate poor ratings. But I am optimistic that this will be be looked at as one of the turning points in this fight. Up until now, independent researchers like me were unable to support all our claims about how crude a tool value-added metrics still are, though they have been around for nearly 20 years. But with the release of the data, I have been able to test many of my suspicions about value-added. Now I have definitive and indisputable proof which I plan to write about for at least my next five blog posts.

The tricky part about determining the accuracy of these value-added calculations is that there is nothing to compare them to. So a teacher gets an 80 out of 100 on her value added — what does this mean? Does it mean that the teacher would rank 80 out of 100 on some metric that took into account everything that teacher did? As there is no way, at present, to do this, we can’t really determine if the 80 was the ‘right’ score. All we can say is that according to this formula, this teacher got an 80 out of 100. So what we need to ‘check’ how good of a measure these statistics are some ‘objective’ truths about teachers — I will describe three which we will see if the value-added measures support.

On The New York Times website they chose to post a limited amount of data. They have the 2010 rating for the teacher and also the career rating for the teacher. These two pieces of data fail to demonstrate the year-to-year variability of these value-added ratings.

I analyzed the data to see if they would agree with three things I think every person would agree upon:

1) A teacher’s quality does not change by a huge amount in one year. Maybe they get better or maybe they get worse, but they don’t change by that much each year.

2) Teachers generally improve each year. As we tweak our lessons and learn from our mistakes, we improve. Perhaps we slow down when we are very close to retirement, but, in general, we should get better each year.

3) A teacher in her second year is way better than that teacher was in her first year. Anyone who taught will admit that they managed to teach way more in their second year. Without expending so much time and energy on classroom management, and also by not having to make all lesson plans from scratch, second year teachers are significantly better than they were in their first year.

Maybe you disagree with my #2. You may even disagree with #1, but you would have to be crazy to disagree with my #3.

Though the Times only showed the data from the 2009-2010 school year, there were actually three files released, 2009-2010, 2008-2009, and 2007-2008. So what I did was ‘merge’ the 2010 and 2009 files. Of the 18,000 teachers in the 2009-2010 data I found that about 13,000 of them also had ratings from 2008-2009.

Looking over the data, I found that 50% of the teachers had a 21 point ‘swing’ one way or the other. There were even teachers who had gone up or down as much as 80 points. The average change was 25 points. I also noticed that 49% of the teachers got lower value-added in 2010 than they did in 2009, contrary to my experience that most teachers improve from year to year.

I made a scatter plot with each of these 13,000 teacher’s 2008-2009 score on the x-axis and their 2009-2010 score on the y-axis. If the data was consistent, one would expect some kind of correlation with points clustered on an upward sloping line. Instead, I got:

With a correlation coefficient of .35 (and even that is inflated, for reasons I won’t get into right now), the scatter plot shows that teachers are not consistent from year to year, contrary to my #1, nor do a good number of them go up, contrary to my #2. (You might argue that 51% go up, which is technically ‘most,’ but I’d say you’d get about 50% with a random number generator — which is basically what this is.)

But this may not sway you since you do think a teacher’s ability can change drastically in one year and also think that teachers get stale with age so you are not surprised that about half went down.

Then I ran the data again. This time, though I used only the 707 teachers who were first year teachers in 2008-2009 and who stayed for a second year in 2009-2010. Just looking at the numbers, I saw that they were similar to the numbers for the whole group. The median amount of change (one way or the other) was still 21 points. The average change was still 25 points. But the amazing thing which definitely proves how inaccurate these measures are, the percent of first year teachers who ‘improved’ on this metric in their second year was just 52%, contrary to what every teacher in the world knows — that nearly every second year teacher is better in her first year. The scatter plot for teachers who were new teachers in 2008-2009 has the same characteristics of the scatter plot for all 13,000 teachers. Just like the graph above, the x-axis is the value-added score for the first year teacher in 2008-2009 while the y-axis is the value-added score for the same teacher in her second year during 2009-2010.

43 Responses

GR, I’m with you. Publishing the scores was puff-your-chest move. Rushing to include VAM in formal evaluations will turn the tide against a potentially promising tool. Demoralizing teachers is popular for reasons no rational person can understand. Economists and reformers can apply high levels of abstraction and little nuance into what is a complex profession.

But you can’t put this type of research standard on VAM and then completely ignore it for current measures of quality: experience and master’s degrees.

Take your second assumption: “Teachers generally improve each year.” For the first five years, there’s suggestive evidence. From years 6-10, a bit less. From years 10-30, close to none.

Dan Goldhaber found nearly identical distributions of teacher quality comparing two groups: those with and those without master’s degrees.

VAM, by comparison, is considerably more reliable. As a teacher for 15 years or whatever it is, you surely know that there’s (significant) variability in the quality of teachers. I think a better path is to let VAM breathe for a few years, let the modeling improve some, and then we’ll see.

I ask you: in the tradeoff between type 1 (dismissing an effective teacher) and type 2 (keeping an ineffective one), which do you choose? The current system runs rampant with type 2. VAM obviously has serious potential for type 1 (and type 2).

Sean, what’s this with letting VAM breathe for a few years until the models improve?

It’s not like they’re piloting this system. Starting in 2012-2013, teachers all across New York State will have 40% of their evaluations come from VAM – 20% from state tests, 20% from local tests, third party tests or the state tests measured a different way than the state measured them.

If VAM is as unreliable as what Gary shows above, we’re going to see thousands of teachers unfairly tarred with the “ineffective” label who wind up in the NY Post with a glossy DOE-provided photo under the headline NY’S WORST TEACHERS.

Maybe if I thought the Regents and the NYSED and Cuomo and Bloomberg and Gates and Murdoch and the rest of the so-called reformers weren’t trying to rid the system of thousands of teachers, I might trust them to implement this system fairly.

But since I know that’s exactly what they want to do, I do not trust them or the system they want to implement.

Given that Bloomberg is on record about wanting to fire 50% of NYC teachers, Gates thinks most teachers suck, and Merryl Tisch believes teachers are THE problem in public education (funny, she’s been a Regent for 15 years, but somehow the problems are never her fault), I think I would be a fool to trust them to fairly implement so complex and easy to manipulate a system.

Therein lies the problem with VAM for me. I do not trust the people implementing it and it is so complex and non-transparent as it now stands that I would be a fool if I did.

Perhaps as you say the model will improve later on.

When that happens, we can then argue the wisdom of basing evaluations on high stakes tests.

Until then, what we have is a poisoned and toxic environment that suggest teachers be wary of any “reforms” the powers that be want to implement, especially ones as complex as VAM. The publication of the TDR’s after the DOE offered promises that they would never do that is an exclamation point on the need for wariness.

An important note that I don’t believe has been mentioned by publishing organizations (and the reason you should not expect a large jump from first to second year): Teachers with one and two years of experience are graded separately, with their percentile rankings representing their performance within the “peer group”. For first year teachers, the peer group consists only of other first year teachers, and likewise for second year teachers; as a result, the expected net improvement in percentile rankings from the first to second year would be close to zero.

Of course this means that its highly inappropriate to compare the percentile rankings of first/second year teachers to those of more experienced teachers—it only makes sense to compare within the same level of experience (>2 years was considered as all the same level of experience for peer grouping purposes). No online databases that I have seen have noted this effect.

Disclosure: I am one of those teachers in your sample group who saw large improvements in value-added scores from the first to second year. I appear to benefit from comparisons to other teachers in my school (for 09-10), when we were in fact not in the same comparison group.

I want to thank both of you for your work. I am a value-added proponent (with 3 kids in public school), but I also think that we do need to subject these scores to rigorous study. After arguing various aspects of value-added methods with many uninformed people, it’s nice to see at least 2 people who are thoughtful about the data.

Great work, Gary. I hope others will take the same approach and expose the problems that are apparent here. And frankly, even if the models improve and there’s stronger correlations, I wouldn’t accept those correlations as proof of overall teaching efficacy. There are still too many assumptions built into the models, and too little of our work in the classroom and school accounted for in the tests.

Gary, I love your analytical approach to these questions, and your conclusions are articulate, admirably spirited, fun to read, and most importantly, firmly anchored in common sense and reality. I do have one question for you. I am not a teacher, or in any way involved in school policy, so I am by no means an expert on the trends you cited as a standard by which to evaluate the validity of these measurements of teacher performance. However, I must say, as a complete outsider, it seems to me that your assumptions #1 and #3 are contradictory. If #1 is true, and “a teacher’s quality does not change by a huge amount in one year,” then how can it be true, as you stated in #3, that “second-year teachers are significantly better than they were in their first years”? I suppose it depends on what you mean by ‘a huge amount’ and ‘significantly better,’ but if #3 is an important exception to #1, are there perhaps other exceptions?

Good question. Generally teachers don’t get that much better or that much worse in one year, but the exception is from the first year to the second year. Most first year teachers are very ineffective and they have a huge jump in improvement from the first to the second year. The reason for this is that by the time they have figured out what to do during their first year, is is too late since the students have pegged that student as not worth listening to.

Would like to read your other columns. HOw do I get them? Your server is turned off. I would only temper your part I results with the small samples sizes for many of those tested. Absolute relevance based on 2 years of data is not practical here rather over several years of comparative analysis. More critically, if the alternative is lumping in everyone together, then I say a value-added approach over a five year period would show who the top teachers are who should have first shot to run our schools and who the poorest performers are who deserve training. That kind of thinking doesn’t currently exist in a one-sized fits all union/political model. I wish all of our kids had Stuyvesant-like teachers. They can’t and this hurts our kids. Data will help indicate who the best and most effective teachers are.

Why did the UFT agree to Klein’s fantasy of using test scores to evaluate teachers in the first place- even with the promise of confidentiality? Was Randi that eager to show she “plays well with others” that she would agree to such a faulty premise? I mean, why go there if you disagree with the fundamental premise?

With the release of the data reports it has been nothing but damage control. The UFT tries to spin it as taking the moral high ground shaming the DoE for the release, but it’s more accurately damage control on their part for helping to create this situation.

Many of us have opposed mayoral control, the 2005 contract, the agreements creating the ATR debacle only to be ignored by Unity Caucus “play well with others” UFT officials. Now the UFT continues it’s endless campaign of dismantling the union and morphing into an ombudsman department of the DoE. No serious confrontation. No alternative vision of what education in the city could be like. Just one attempt to mitigate the damage and embarrassment to the union and culpability of the Unity political machine taking the course of least resistance with the only bottom line being the flow of union dues.

Yeah, I agree. The unions are supposed to be the one supporting teachers even when it ‘puts the needs of adults before the needs of children’ yet it seems that they are often selling out the teachers. I don’t know if it is because they are intentionally not doing their jobs or if they are just not good at it.

They don’t think their members of the UFT will ever strike. So, they have to cut these deals. The DoE pushes the UFT 5 steps back, the UFT battles back 3 but hails the gain of the three as a great victory, even though there’s a net loss of 2 steps. It’s jujitsu, just minimize the damage and call it a victory.

There’s a lot of good people in the UFT HQ and Unity. But, it’s still a political machine and undemocratic. For example, District Rep positions used to be elected by the chapter leaders they serve. Now they are all appointed. The UFT is itself a corporate-style entity. If they stood for something they would have endorsed Thompson for mayor but instead didn’t endorse thinking they had a deal with Bloomberg, who played the UFT for chumps. Their political expediency failed. But, they will never admit it was a mistake.

That’s true that IF there were a lot of overlapping points, the scatterplot might be misleading. But this also would be captured by the correlation coefficient. When there are not a lot of overlapping points, the density is just the apparent darkness by having points close together. Your hex density graphs have a way of making a .3 correlation look better than it is, depending on the size of your hexagons. I will do some checks to see if I have a lot of overlapping.

Gary, this is untrue. If you honestly believed what you’re saying, then you would have turned on some level of scatter, to blur the quantized effect. Regardless, that is just a band-aid solution, and EB’s original comment is still preferred, from an analysts’ perspective.

Others have already commented upon this, however since it’s an issue that likely needs strength in numbers, I reiterate that you simply used the wrong sort of graphical analysis.

I am a statistician by trade, and fully understand the frustration embedded in the quote “there are lies, damned lies, and statistics”. Either ignorance or a willful intention to mislead directed your choice in plots.

A density plot, a correlation value, or some other accurate measure of spread should have been used. Please either do not use statistics in your arguments, or use them accurately. Do not enable folks outside of the trade the satisfaction of shouting the mantra of “lies, damned lies, and statistics”.

I do appreciate your comments. The correlation value was, I think, r^2=.3. The data is all publicly available. If you want to re-analyze to show that value-added is very reliable and stable, feel free to do it. Certainly you’d have to admit that a teacher teaching 7th and 8th grade in the same year should get relatively similar value-added scores if these scores are to be used for salary increases and layoffs, no?

First off, thanks for the reply! I think my initial comment was harsher than I’d meant, looking back upon it.

To your follow up point, any explanatory factor that can be captured, which would bump up the correlation, should be captured. In other words, if there is some factor differentiating the data sets, and you expect that factor to have little impact, then the R^2 SHOULD be approximately the same whether you include the factor or not. If the correlations change, then that factor is acting upon the data in some fashion.

In this case, if you compare data with 7th & 8th graders together, and each of them separately, and find the correlation to be stronger — with statistical confidence — when the classes are analyzed separately, then there is something going on to make that 7th & 8th grade difference significant. Regardless, if you can improve the R^2 by separating data through explanatory parameters (not just cherry-picking the removal or inclusion of outliers), then it should be separated.

Hypothesis testing fundamentally runs on this concept: go on the assumption that there is no difference, then try to disprove yourself. In this case, assume there is no difference between 7th & 8th, then see if the R^2′s are significantly different, to see if your assumption is wrong. If it is, and there is a difference, then your approach & article should take the highest correlation. Stats / Science / Data science all must be run conservatively: study & analyze the data to show the biggest chink in the armor of your claim. If that chink is still insignificant, then you have indeed proven your claim.