Education Next is a journal of opinion and research about education policy.

My good friend Jay Greene is back this week with yet another assault on the Gates Foundation and its Measures of Effective Teaching (MET) project, the final results of which appeared on Tuesday. Jay accuses the foundation of failing to disclose the limited power of classroom observation scores in predicting future test score gains over and above what one would predict based on value-added scores alone. In fact, he goes so far as to imply that classroom observations are not predictive at all, rendering them useless as a source of diagnostic feedback. His arguments on these points are not compelling. (Full disclosure: MET principal investigator, Tom Kane, is a senior colleague of mine at the Harvard Graduate School of Education – do with that what you will.)

First, the idea that Gates has somehow suppressed information on the predictive power of observations is downright silly. In addition to featuring relevant results in Table 1 of the accompanying technical report (which Jay obviously had no trouble finding), the second “key finding” in the summary document states that “The composite that best indicated improvement on state tests heavily weighted teachers’ prior student achievement gains based on those same tests.” In other words, if all one were trying to do is to predict gains on state tests, one would use an evaluation system that places a great deal of weight – perhaps as much as 80 percent, we learn from Figure 3 – on value added.

MET argues for a more balanced set of weights among value added, classroom observations, and feedback from student surveys on other grounds. When you move toward a more balanced set of weights you lose correlation with state tests, but you gain two things (at least initially): (1) modestly higher correlation with gains on tests designed to measure “higher order” thinking and (2) far higher reliability. To their credit, however, the MET researchers also note that one should not go too far. They explain that if you assign less than 33 percent of the weight to value added, you end up LOSING not only correlation with state test score gains, but reliability and correlation with those other tests as well.

Now, most of the gains in reliability from a more balanced set of weights come from the student survey feedback. The real argument for investing in classroom observations is their potential diagnostic value. Jay dismisses this possibility on spurious grounds. He claims that “if they are not useful as predictors, they can’t be used effectively for diagnostic purposes.”

But the report shows that they are predictive of test score gains. In Table 1 of the technical report (on which Jay bases his critique), the MET team uses evaluation measures from 2009-10 to test their ability to “post-dict” teachers’ effectiveness the previous year. Columns (4) and (8) report results using observations alone; in three of the four tests (the exception is elementary grades ELA), observations are statistically significant predictors. This, however, is not their strongest evidence.

The single most important contribution of the MET study is its use of random assignment (of teachers to classrooms) to validate their overall effectiveness measure and its constituent parts. Table 10 confirms that classroom observations pass this test with flying colors. (So do value added and student surveys.) So when they directly test Jay’s argument—whether observations predict student achievement gains—the find that the answer is yes—whether when “post-dicting” gains on a non-experimental basis (Table 1) or when predicting gains following random assignment (Table 10).

Does this prove that the information observations provide can be used to improve teacher effectiveness? Of course not. And certainly they don’t yet (and likely never will) provide the detailed guidance that Jay, in a follow-up post, faults Bill Gates for promising. But we do have an existence proof in the form of a recent paper by Stanford’s Eric Taylor and Brown’s John Tyler, which shows that veteran teachers in Cincinnati improved after undergoing an intensive observation-based evaluation program. It should also be noted that the MET results are all based on existing off-the-shelf observation protocols. At least in theory, these protocols could be refined over time to improve their predictive power and diagnostic value.

Jay’s clearly right about one thing: observations are costly. He may even be right that these costs outweigh the benefits, but we do not know enough to say and that is not what he argued. (A focus of Kane’s ongoing work is figuring out how to use technology to bring the costs down while still generating reliable information.) I also share Jay’s broader concerns about the ongoing effort to prescribe mechanistic teacher evaluation systems on a district- or state-wide basis. This effort may well turn out to be an “expensive flop.” But that is hardly the right descriptor for the MET project.

I didn’t misunderstand anything. As I correctly noted in my post there is virtually no relationship between classroom observation scores and test score gains.

Readers should look at Table 1 and they will see that the relationship is not even statistically significant at conventional levels (p < .05) in 2 of the 4 univariate regressions. Marty only says 3 out of 4 above because he is counting the result at p < .1, which is not statistically significant at conventional levels. And in all four of those univariate regressions the point estimate shows that a one standard deviation increase in classrooom observation scores is associated with a .11 std dev or less increase in test score gains. Exactly as I said, classroom obervations are either unrelated or barely related to test score gains.

Even worse, if we look at the multivariate regressions in Table 1 we see that classroom observations make virtually no independent contribution to predicting test score gains. In only 1 of the 4 multivariate regressions in Table 1 is there any statistical significance to classroom observations. In that case the point estimate shows that a std dev increase in classroom observation scores is associated with a .06 standard deviation increase in test score gains. And that is the highest of the four point estimates; one of which actually shows a negative relationship between classroom observations and test score gains.

Table 10 that Marty also references provides no information on the relationship between classroom observation scores and test score gains. It only shows that the composite index is predictive of test scores, but says nothing about the contribution (or lack therof) that classroom observations make to that predictive power.

I’m wiling to believe that classroom observations could provide useful information (as other studies have shown), but the MET project, despite spending $45 million to study it, failed to find much of anything useful from classroom observations. The classroom observations in MET tell us basically nothing about how to improve teaching practices or how best to observe classrooms, despite the fact that that is precisely what the project was designed to do.

If Marty doesn’t see this as an “expensive flop” then I’m not sure what he would consider one.

Jay, take another look at Table 10. Column 6 provides evidence of the power of the classroom observation component used on its own in predicting test score gains when teachers are randomly assigned to classrooms. If it had no predictive power, the coefficient would be zero. If it were an unbiased predictor, the coefficient would be one. They reject the former but not the latter.

The table does not tell us how much the classroom observation component improves the predictive power of the overall index over and above the other components. Based on the evidence elsewhere in the report (i.e., the regressions you point to in Table 1), the likely answer is very little. But it does not need to make an independent contribution to predicting test score gains in order to be useful in improving practice.

Marty, I like and respect you, but I think your sunny optimism is making you emphasize the glass as 1/8 full. My gloomy nature sees it as 7/8 empty.
Classroom observations are not always statistically insignificant in their relationship to test score gains in MET, they are just insignificant much of the time and very small even when significant.
We both agree on this. You just see some instances of a small relationship as encouraging that the project might ultimately yield something useful about how to improve teaching practices from classroom observations. I think the small and inconsistently significant relationship makes finding much of value for how to improve teaching practice unlikely. And confirmation of my gloomy view can be found in the fact that this final round of reports from the MET project contained nothing about improving teaching practice based on classroom observations. If classroom observations in MET could be useful I think they would be telling us about it in great detail and with much fanfare.

The focus on regression relationships between test scores and observation scores overlooks another dimension of observation scores. They reflect other aspects of what educators might think of as ‘good teaching.’ Even if observation scores are completely uncorrelated with test score gains, parents and communities might prefer that teachers score well on them. It is evidence that ‘good teaching’ is happening by the criteria used to create the observational tool.

Or, put another way, if teachers were generating high test score gains from their students by creating a climate of abject fear in their classrooms, their observation scores should be low and that information is useful.

This is not to defend or criticize the MET results, but to point out that MET study tried to validate observation scores by examining how well they predicted test scores, which equates teaching with test scores. Not finding much correlation beyond what pre-tests already tell us, we could dismiss observations as pointless and expensive as Jay has argued elsewhere. But test score gains plus observation scores might be a better look at ‘good teaching.’ It takes us back to the starting point of trying to define the outcome we want, but just because the MET study did not validate observation scores should not mean that teaching and test scores now are equivalent.

As a follow-up thought here, I want to note a curious finding from the MET study about observation tools. The ‘Gathering Feedback’ report (page 30) states that for all five classroom observation tools the team analyzed, ‘the first principal component was simply an equal-weighted average of all competencies.’

When shorn of its statistical weight, this statement is saying that the tools do not tell us what makes good teachers good, which was one of the aims of the MET effort. Good teachers are good at everything related to teaching.

The report goes on to say the second component is about classroom and student behavior management, which will surprise nobody. But the report notes that the first component accounted for much of the variation in observation scores, so this second one is dominated by the first.

It’s unclear why observation tools are being handled in the MET reports with kid gloves. They may help to identify overall ‘good teaching’ as I noted in my previous comment, but they are somewhat surprisingly silent on specific competencies that distinguish good teachers. And why observe a range of competencies if they all move together? Even a couple might suffice.

Although I enjoyed the spirited back-and-forth above, that debate ignores the main problems regarding the recently released MET report:

1) The Report is unusable as a guide to policy or practice.

2) Exaggerations of the significance of the MET results are being broadcast to the public using the sort of spin often used to make products appear to be better than they are.

Much of what matters most educationally is simply not on standardized tests, and greater single-year test scores gains can be obtained through methods that are broadly counterproductive while lesser short-term test score gains often result from an approach that is broadly superior.

There is simply no way to look at annual reading and math test score gains and reach trustworthy conclusions about student learning or the quality of teaching. I wish humans and education were simpler than that, but human development and learning are vastly more complex than the linear progress seen as a brick wall is built.

1. Isn’t it as simple as: better feedback will improve teacher effectiveness and more effective teachers will have greater impact on student achievement.
2. This study can and should be used as a guidepost along the education/teacher evaluation road. I dare predict that every school district that implements the 3-legged stool from this study will have significant variations in their applications.
3. Expecting any study to give us exact steps to follow in creating policy or setting a list of best practices is an example of the lack of ‘higher order thinking’ that we all (hopefully) want to inspire in the rising generation. Think, converse, apply.