Outcomes—the Holy Grail of Teacher Prep Evaluation?

Since NCTQ’s Review of teacher prep programs was released, there’s been a lot of much-needed conversation about teacher preparation in this country. Long the forgotten segment of the teacher pipeline, the past week has brought into focus teacher prep’s true place as the cornerstone of the profession.

With this great responsibility—to aspiring teachers, their future employers, their future students, their future students’ parents—inevitably comes accountability.

When we set out on the path that led us to the Review, we intended to inject one measure of many that may be needed for robust teacher prep accountability. Given the size of the teacher prep system and the nature of what data is available, we felt confident in looking at the structure of teacher preparation programs—the courses they require, the readings candidates have to do, the number and kinds of observations they receive during student teaching, and so forth. It’s this perspective that dredges up findings like: 70 percent of elementary programs don’t require a single basic science course of their candidates (neither through their ed coursework or their general university requirements), or one in four secondary programs don’t require a full subject-specific methods course, or one in five programs we were able to study on their use of outcomes data don’t conduct surveys of their grads.

As has been frequently pointed out in the past week, these are inputs.

NCTQ is an organization that garners pushback for our advocacy that teachers’ outcomes data (VAM being a common example of such) should be considered in teacher evaluations, compensation and promotions, so it’s almost refreshing to find ourselves seemingly on the other side of things when it comes to measuring the impact of teacher preparation.

The thing is, we’d love to look at outcomes data. Our Evidence of Effectiveness Standard was designed to account for states’ outcomes data. Unfortunately, the current generation of data is rife with structural mismatches, and ultimately we could only apply this standard for one school, out of 1130 in the Review!

Yesterday, Paul E. Peterson took a stab at comparing our inputs-focused findings with Florida’s outcomes data. We hope more folks take on this type of reflection, and so we’d like to point out a few of the tripwires Peterson came across.

First, he found how little the samples overlap. He concluded his analysis with this warning: “our data set overlaps that of NCTQ for only a few institutions. Generalizing from these few cases is risky.”

In fact, his dataset overlaps even less than we all hoped.

Here’s one example of a common mismatch: Mismatched programs

Peterson’s 2011 study cited Florida’s VAM findings for St. Petersburg College, suggesting that the college’s graduates are more effective in teaching reading and math (although their scores are not actually different enough to be statistically significant). He contrasts these VAM findings with the low program rating the school received in the Review for their reading and math instruction. However, the data in the 2011 study was based on St. Petersburg College’s undergraduate elementary program and the Review evaluated the graduate elementary program (technically, a post-baccalaureate program).

We hope that someday the instruction provided to aspiring teachers at an IHE, regardless of whether they’re in an undergrad or grad program, would reflect a systemic vision for what it means to be a prepared teacher. If this were the case, undergrad VAM data might be a reasonable proxy for grad performance. For the moment, however, graduate and undergraduate programs on the same campus too often have too little in common. Only one in five institutions where we were able to review both the undergraduate and graduate programs received the same scores in reading and math instruction, for example.

Here’s another example of a common mismatch: Mismatched years

Schools of education frequently change their course offerings and requirements (the widespread adoption of TPAs in the last couple years is one example). While the vast majority of the data sourcing the analysis in the Review is from the last couple years, student performance data often lags. The student performance data used in Peterson’s study is from 2002-2009 for teachers who graduated as far back as 1995. Given the changes programs undergo, it would be a stretch to compare inputs from 2012 to outcomes from training conferred a decade or even two decades prior.

On the outcomes front, most institutions’ data systems, most states’ data systems and the first edition of the Review leave much to be desired. But the consensus we’ve seen in the dialogue around teacher prep this past week that this is a much-needed measure is absolutely heartening.