Value Added — Scrutinizing The Most Widely Cited Study

by Gary Rubinstein

It was the best of teachers. It was the worst of teachers.

But how much better are the best teachers than the worst teachers? Well THAT’S a tough one, but one that is very important to answer. The corporate reformers believe that the gap is great so a feasible solution is to fire those bad teachers.

Now, nobody who ever had more than two teachers in their lives would argue that there is no difference in teacher quality. As a student I’ve had great teachers and lousy ones. As a teacher I’ve been a great one and a lousy one at different parts of my career. But we do need to get an accurate measure for how many and how bad the ‘bad’ teachers are before we go forward with a plan that assumes the worst.

Of course the reformers don’t just assume the worst, they say that ‘research shows’ that they are right. One of the most quoted pieces of research by reformers like Michelle Rhee and Wendy Kopp is that the difference between a student who has an effective teacher three years in a row vs. a similar student who has three ineffective teachers in a row is ‘life-changing.’

Watch the first thirty seconds of this video from 20:00 to 20:30 to see an example of this statistic being described.

The challenge in proving or disproving this is that we can’t go to a parallel universe and compare what would have been had this student had this combination of teachers rather than that combination.

So we settle for educational research. Like psychology, education research is sometimes called a ‘soft’ science. Unlike something like chemistry or physics, the experiments that prove different theories are generally impossible to replicate. Also, there are many ways to use statistics to analyze the results of the various experiments and, like with most science, the interpretation can often be affected by what the researcher is hoping is true.

With this in mind, I decided to take on a challenge I’ve wanted to for a while. Find what study Rhee is referring to, study it, and interpret it for myself. I looked at presentations from The New Teacher Project, which Rhee founded for clues. I learned that the main two value added studies are from 1996 and 1997. The 1996 Cumulative and Residual Effects of Teachers on Future study was conducted by Sanders and Rivers from Tennessee. The 1997 study ‘Teacher Effects on Longitudinal Student Achievement’ was conducted by Jordan, Mendro, and Weerasinghe from Dallas. Both studies concluded that the difference between having three effective teachers vs. three ineffective teachers can result in significantly higher standardized test scores after three years. They don’t make the leap to say that these standardized test score improvements are ‘life changing.’

I downloaded both studies. The Sanders study was very short so there was not a lot of raw data for me to play with to see if I could discover some of my own conclusions. But the Jordan study was 50 pages long with about 30 pages of raw data. The Jordan study is also the source of the most common example cited by reformers.

A study conducted by Jordan et al. (1997) estimated that average reading scores of sixth graders in Dallas schools would be expected to increase from the 59th percentile to the 76th percentile if they were assigned to three highly effective teachers in a row, while average scores for sixth graders would be expected to decrease from the 60th to the 42nd percentile if they were assigned to a series of three highly ineffective teachers during the same period. In mathematics, third graders in Dallas schools would be expected to increase their average mathematics score from the 55th percentile to the 76th percentile if they were assigned to three highly effective teachers, while the average mathematics score for third graders would be expected to decline from the 57th percentile to the 27th percentile if they were assigned to highly ineffective teachers for three years in a row.

Often a reproduction of one of the 20 graphs from the report is displayed in presentations comparing the scores of three groups of students who had similar scores in 1st grade and then after having three ineffective teachers, three medium teachers, or three effective teachers.

To get a context for interpreting the consequences of this chart, I read the Jordan report. Here’s what I understand about their methodology:

First they get a group of 3,200 kids about to start 2nd grade. They will track the reading scores for this group for three years, until they complete 4th grade. This group is called R4. They then get another 3,200 2nd graders to track math, and call them M4. Then they do the same with 3rd graders, 4th graders, 5th graders, and 6th graders to get 8 more 3,200 student cohorts, R5, R6, R7, R8, M5, M6, M7, M8.

They want to see the effects of having ineffective teachers vs. having effective teachers so they take the teachers that each of the 32,000 kids have each year and give them an effectiveness ranking, either 1 for bottom 20%, 2 for 20% to 40%, 3 for 40% to 60%, 4 for 60% to 80%, and 5 for 80% to 100%.

Then after three years, they sort the kids into groups depending on what quality teachers they had. If a student in R4 had a 3 teacher, a 5 teacher, and then a 2 teacher, they are in group 352. There are 125 groups for each 3,200 student cohort ranging from 111 to 555. If students were randomly assigned teachers, there should be at least 20 kids that had each combination of teacher.

Finally, they compare the change in standardized test scores for the different combinations. How much better did the R4 kids in group 555 do than in group 111, or group 314, or any other combination.

Sounds fair, right?

Well, the first thing that should cause some concern is the initial rating of the teachers. Their effectiveness is not an objective thing as would be, for instance, number of years teaching. So the teachers who are effective are the ones who are known to get the good test scores.

What this means is that the study is at risk of simply proving that ‘effective teachers are effective’ and ‘ineffective teachers are ineffective,’ something that must be true by common sense. But they are also studying the effects of having several effective or ineffective teachers in a row. Maybe the results magnify, maybe having three effective teachers in a row is a lot better than the benefit you get from having one effective teacher multiplied by 3?

Another potential flaw in the methodology is whether or not students are truly randomly selected to be in group 111, or 314, or 555. Perhaps there are ‘confounding’ factors that made the sampling not random. On page 6 of the Jordan report, they even admitted

Because there was some bias in student assignment to groups (which will be reported on in further research) …

This offhand omission is pretty serious as it could invalidate all the results!

In the appendices, I was able to find some of the data that they used to produce the graph. They don’t compare group 111 to 555 because those two groups had different starting points. Instead they compare 111, 314, and 535. The raw scores in the low group went from 57.2 to 33.4. The raw scores in the middle group went from 57.3 to 50.2. The raw scores in the high group went from 56.6 to 59.7. To make the graph, they translated these scores into percentiles. They did not provide the conversion table, so I wasn’t able to verify these percentiles. Since the percentiles weren’t provided, I had to use, for my examples that don’t support the hypothesis, numbers based on just raw score changes.

When you look at the data it is, at first, pretty intimidating. 125 combinations for 10 different cohorts. To test the stability of the data, I thought of a good experiment: I took the results for cohort R8 and checked the starting scores and ending scores for all six combinations of the numbers 134. If their process is stable, it shouldn’t matter if kids have the three teachers in the order 314 or 134. It’s still a middle teacher a low teacher and a high teacher.

I found that although they all started with about 42%, three years later the scores ranged by a lot. So either this data is highly unstable or somehow the order that you encounter these teachers makes a big difference. This is something that was not mentioned in the paper, but it is very revealing, and probably the easiest way to refute the accuracy of their data.

Here are the results for the six possible orderings of 1,3,4:

R8-134 42.3% to 41.7% change of -.7%

R8-143 42.6% to 40.8% change of -1.8%

R8-314 40.6% to 38.9% change of -1.6%

R8-341 42.8% to 35.8% change of -7.0%

R8-413 42.7% to 34.0% change of -8.8%

R8-431 45.7% to 42.0% change of -3.7%

As you can see, the results are very unstable.

One easy thing I did was take group R4 and do a scatter plot on their gains. Rather than have 125 different combinations to check, I instead added the three numbers together. My assumption is that if someone had combination 314, it’s the same as having 413 or 134 or 341. I’m also making an assumption that 314, since it adds up to 8, is similar enough to 332, 431, or even 215. This made it easier to get a handle on the data since now there were only 13 groups to check (a sum of 3 at the minimum for 111 up to 15 for 555).

When I made a scatter plot for group R4 based on this, it looked like this:

The first thing to notice is that the one dot for 15, the 555 group actually had only a 2 point gain. The 111 was not the lowest score and 555 was not the highest. Also, the 14s did not do that well, considering they had two 5s and a 4. The 12s who averaged 4 per teacher had as may classes with negative gains as positive ones. The 9s (including the 3/3/3 group) had all negative gains (though that could still be an increase in percentile. They did not provide the conversions except for their 20 graphs).

Another thing to notice is that there is what I’d call a ‘weak’ linear correlation — weaker than I would have predicted, actually, considering the circular nature of the methodology where they are proving that effective teachers are, indeed, effective.

Another inconsistency I noticed is that in the 10 cohort groups, the 555 group had one of the top two gains only in M5. The 111 group was in the bottom two twice.

Now, to me, the weak linear correlation is much less than I would have predicted, but you might think that the fact that there is any correlation at all proves, in some way, the point.

So, if you think that the study is valid, I have another argument: So what?

If having three great teachers in a row is so much better than having three horrible ones, how does that guide us in what to do? Sanders says the data is useful since it can help us give ineffective teachers support so they can become more effective. He also thinks we can use this data to prevent a student from getting a bad combination of teachers. There’s no mention of firing those teachers. The ironic thing is that these studies have happened fifteen years ago and Tennessee and Dallas have had the chance to use their value-added studies to improve education in that state and city. And where has it gotten them? Neither is known as a leader or model for successful ed reform.

Now Rhee and the corporate reformers use these studies to justify making it easier to fire teachers. But does she take the authors’ advice and have any plan to help ineffective teachers improve? Rhee and others using these (flawed, in my opinion) research papers to justify their agenda reminds me, a bit, of Nazis (I know this is a low blow) using Nietzsche’s philosophy books to justify their agenda.

[Note: I have since apologized to Michelle Rhee for using an analogy like this. Name-calling is not a mature thing to do, and I shouldn't compare someone who I have no doubt believes that her work will help students with a group of people who murdered millions of people. I will not use such analogies in the future.]

Now, I should point out that as a highly effective teacher myself, don’t I like to think that I make more of a difference than an ineffective co-worker? Of course I do. But I don’t teach to the test so it is unlikely that my students’ test scores would be remarkably higher than my peers. I actually did have the opportunity when I taught in Houston to have many of the same students for three consecutive years. That was almost 20 years ago. I don’t think their standardized test scores went up by that much, but I do know that the number of times they went home and told their family about the math riddle of the day was raised. I also feel like when their children are going to school, they will tell those children about how fun math is — all something I contributed to in an immeasurable way.

Yes, there are good teachers and there are not-so-good teachers.

If all teachers were a little better, that would be good for kids. I think that the number of math teachers who don’t really know or love math is a problem — but we just don’t have enough who do know and love math who are willing to do it.

And I remember having some duds when I was in school. I don’t think it hurt me much. A teacher who was bad for me might have been good for someone else. Who am I to say? And I don’t think I would have even wanted to have all ‘highly effective’ teachers. Imagine going through 8 periods a day in high school with a ‘Dead Poets Society’ Robin Williams teacher? I think I’d have a heart attack. I liked my mix. Some good ones, some OK ones, a few pretty bad ones.

But I digress … The main point here is that there were some value added studies which have been adopted by the reformers. When the Jordan study came out in 1997, it wasn’t such a threat to teachers and the U.S. education system that I don’t know if many people took the time to examine it more closely. Now, that study has taken the aura of legend. We don’t question it. It just is. But since so much seems to be riding on the 3 effective teachers in a row vs. 3 ineffective teachers in a row conclusion, I hope that I’ve pulled this study out of its vault and raised at least some reasonable doubt about its validity — especially now that it’s become a weapon in the wrong hands.

9 Responses

A splendid combination of analysis and putting-into-perspective. I am an educational researcher myself and promised myself to distrust statistics even more than I have to professionally trust it. Your article is a great reminder of that promise.

Besides, no matter how many teachers Rhee fires based on this and other self-deceiving studies, there will always remain a ‘bottom 20%’. And another ‘bottom 20%’. And a new ‘bottom 20%’.

It is high time to open our eyes, and do what is really necessary to make the teaching job enough attractive for the best & brightest students, who would gladly spend a lifetime teaching our kids & enjoying it because of the wonderful job that it is. Building a great teaching corps is all about attracting the right people, not about scaring the average Joes away.

In order to attract the right people, the job must offer respect, stability, autonomy, and proper work conditions that allows teachers to invest & offer quality. They perfectly understood this in Finland, and with splendid results. Why can’t Rhee and the likes of her get this into their heads?

What people have to understand is that the TRUE motivation behind all these the toxic “reforms”—value-added analysis, merit pay, destruction of seniority, dismissing the value of low class size, crushing teacher unions, de-professionalization of teaching, etc.—has nothing to do with improving the education of children. It’s about reducing the costs of public education by firing the highest-paid teachers regardless of quality, and in most cases of veteran teachers, in spite of their high quality.

You need to go back to where it starts. Corporations want to have higher profit-margin so that their price of their stockholders shares will go up. It’s the same with wealthy people who want to pay less taxes to the government, and spend it on themselves.

Both groups say, “Gee, how can we pay less taxes? Well… let’s see… the biggest line item in the government budget is EDUCATION, and the biggest line item within the education budget is TEACHERS’ SALARIES, and the largest amount of THAT money goes to VETERAN TEACHERS because of the union-negotiated salary scale.

Well, if we could just be able to fire those veteran teachers “at will”, (or make them as as close to being fired at will as possible), think of how much less taxes we’d pay, how much more profitable our companies would be, how much the price of our stock shares would go up, how much money we wealthy people could spend on ourselves?”

Keep in mind that the wealthiest folks and those sitting atop those businesses do not send their kids to public schools, so if they gut the funding and thus, trash those schools, it’s no skin off their nose.

If the public knew that’s what was going on, they would be in an uproar. That’s where well-paid demagogues like Michelle Rhee come in. That’s why you have Bill Gates/Eli Broad/the Walton family buying front people like Michelle, and film makers like Davis Guggenheim to push this evil agenda.

These front people go around pretending that they’re pushing this because they actually care about improving public education. They tell everyone that that is the reason they’re pushing value-added analysis, merit pay, destruction of seniority, crushing teacher unions, dismissing the value of low class size, the devaluing / elimination of university teacher education, etc.

Pursuant to that end, they misuse or distort studies like the one being critiqued in this article. Gates & Co. also commission studies from questionable groups—also funded by them— to come up with pre-ordained conclusions that will back up their agenda.

For example, a Gates-funded educational think tank produced a study which “proved” that teachers peak at 3-5 years, and then starting Years 4-5-6, steadily decline in their teaching quality from that point on, so that by Year 10, these teachers have grown so lazy and complacent they’re now worse than when they were first-year teachers.

The obvious action plan arising from such a study: fire or “counsel out” all teachers within 5 years, and make the faculty of all schools have 5 years or less experience. It’s not because it’s cheaper, Bill Gates tells us. No, it’s because it’s better for the educational outcomes of students. The fact that it’s cheaper is just a coincidence.

That’s one of the underlying principles of Teach for Awhile… err… Teach for America.

Yeah, that’s the solution. Make the profession of teaching more like fast foot workers, or retail workers. In fact, teaching shouldn’t be a profession at all.

The same goes with pushing to raise class size. After all, the “studies” show that raising the class size will not negatively impact student achievement, so that means we don’ t need to hire (and conveniently, pay) as many teachers as we’ve been doing all this time.

They tell us we need to de-professionalize teaching altogether. Pursuant to that end, let’s create a bogus group like NCTQ to push for this. NCTQ then comes up with conclusion that all university Departments of Education are worthless in preparing teachers, and all you need is five weeks of training before entering the classroom. I mean, why pay teachers more for attaining Bachelor’s, or Master’s, or state credentials, or passing state boards, when the “studies” show that having/passing these don’t make you a more effective teacher. They just cost taxpayers more for something that’s not really effective to begin with.

Indeed, NCTQ “studies” show that these degrees, certifications, state boards, Subject knowledge/expertise all make you a worse teacher, what with all that high-falutin’ “Child Development” and “Classroom Management” nonsense they cram into your head, and all that in-class mentoring/Student Teaching that just confuses you, when, in fact, you’d be a much better teacher without all that.

Mind you, the NCTQ is made up of people who’ve either never taught a day in their lives, or barely taught a year or two, and when they critique the Departments of Ed, they do so without ever visiting them, or sitting in on their classes, or interviewing their students, or interviewing the principals of the schools where these students teach. No, they sit in a room in another state, and then do on-line reviews where they only need to read the syllabi of the courses, and that’s their basis for concluding that Departments of Education should be abolished.

This is hardly a linear situation. Consider that bright children typically score higher on tests. Give a bright child a poor teacher, she will still quite likely learn the material. Give a less intelligent student the same teacher, she may be lost.

Excellent study!! Thanks so much, Gary.
I hadnt heard of your blog until today, but you and I seem to be on parallel tracks in terms of debunking the educational Deformers like Rhee and company.
You might want to look at my blog as well.
Gfbrandenburg.wordpress.com

Our organization recently looked at the Dallas data and found that a student who three teachers rated 5 had a median increase of 6.916 while those who had two teachers rated 5 and one rated 4 had a median increase of 7.475. These exciting results show that rather than concentrating on putting an excellent teacher in every classroom, we should be striving to put an excellent teacher in only 2/3 of the classrooms.