Test Scores Gains Are Not Necessarily a Sign of Better Instruction: A Cautionary Tale From Newark

In this series, I've been breaking down recent research about Newark, NJ's schools. Reformy types have been attempting to make the case that "reforms" in Newark over the past several years -- including charter school expansion, merit pay, Common Core alignment, school closures, and universal enrollment -- have led to gains in student learning. These "reforms" are purportedly the result of Facebook CEO Mark Zuckerberg's high-profile, $100 million grant to the city's schools back in 2010.

Bruce Baker and I looked carefully at this study, and added our own analysis of statewide data, to produce a review of this research. One of our most important findings is that most of the "gains" -- which are, in our opinion, educationally small anyway (more on this later) -- can be tied to a switch New Jersey made in 2015 from the NJASK statewide exams to the newer PARCC exams.

As I noted in the last post, even the CEPR researchers suggest this is the most likely explanation for the "gains."

Assuming both tests have similar levels of measurement error, this implies that the PARCC and NJASK were assessing different sets of skills and the districts that excelled in preparing students for PARCC were not necessarily the same as the districts that excelled at preparing students for NJASK. Thus, what appears to be a single-year gain in performance may have been present before 2015, but was simply undetected by earlier NJASK tests. (p. 22, NBER, emphasis mine)

As I pointed out last time, there has never been, to my knowledge, any analysis of whether the PARCC does a better job measuring things we care about compared to the NJASK. So, while the PARCC has plenty of supporters, we really don't know if it's any better than the old test at detecting "good" instructional practices, assuming we can hold things like student characteristics constant.

But even if we did have reason to believe the PARCC was a "better" test, I still would find the sentence above that I bolded to be highly problematic. Let's look again at the change in "value-added" that the CEPR researchers found (p. 35 of the NBER report, with my annotations):

"Value-added" -- ostensibly, the measure of how much the Newark schools contributed to student achievement gains -- was trending downward prior to 2014 in English language arts. It then trended upward after the change to the new test in 2015. But the CEPR authors say that the previous years may have actually been a time when Newark students would have been doing better, if they had been taking the PARCC instead of the NJASK.

The first problem with this line of thinking is that there's no way to prove it's true. But the more serious problem is that the researchers assume, on the basis of nothing, that the bump upwards in value-added represents real gains, as opposed to variations in test scores which have nothing to do with student learning.
To further explore this, let me reprint an extended quote we used in our review from a recent book by Daniel Koretz, an expert on testing and assessment at Harvard's Graduate School of Education. The Testing Charade should be required reading for anyone opining about education policy these days. Koretz does an excellent job explaining what tests are, how they are limited in what they can do, and how they've been abused by education policy makers over the years.

I was reading Koretz's book when Bruce and I started working on our review. I thought it was important to include his perspective, especially because he explicitly takes on the writings of Paul Bambrick-Santoyo and Doug Lemon, who both hold just happen to hold leadership positions at Uncommon Schools, which manages North Star Academy, one of Newark's largest charter chains.

Here's Koretz:

One of the rationales given to new teachers for
focusing on score gains is that high-stakes tests serve a gatekeeping function,
and therefore training kids to do well on tests opens doors for them. For
example, in Teaching as Leadership[i] – a
book distributed to many Teach for America trainees – Steven Farr argues that
teaching kids to be successful on a high-stakes test “allows teachers to
connect big goals to pathways of opportunity in their students’ future.” This
theme is echoed by Paul Bambrick-Santoyo in Leverage Leadership and by
Doug Lemov in Teach Like a Champion, both of which are widely read by
new teachers. For example, in explaining why he used scores on state
assessments to identify successful teachers, Lemov argued that student success
as measured by state assessments is predictive not just of [students’] success
in getting into college but of their succeeding there.

Let’s use Lemov’s specific example to unpack
this.

To start, Lemov has his facts wrong: test
scores predict success in college only modestly, and they have very little
predictive power after one takes high school grades into account. Decades of
studies have shown this to be true of college admissions tests, and a few more
recent studies have shown that scores on states’ high-stakes tests don’t
predict any better.

However, the critical issue isn’t Lemov’s
factual error; it’s his fundamental misunderstanding of the link between better
test scores and later success of any sort (other than simply taking another
similar test). Whether raising test scores will improve students’ later success
– in contrast to their probability of admission – depends on how one
raises scores. Raising scores by teaching well can increase students’ later
success. Having them memorize a couple of Pythagorian triples or the rule that b
is the intercept in a linear equation[ii]
will increase their scores but won’t help them a whit later.

[...]

Some of today’s educators, however, make a
virtue of this mistake. The[y] often tell new teachers that tests, rather than
standards or a curriculum, should define what they teach. For example,
Lemov argued that “if it’s ‘on the test,’ it’s also probably part of the
school’s curriculum or perhaps your state standards… It’s just possible that
the (also smart) people who put it there had a good rationale for putting it
there.” (Probably? Perhaps? Possible? Shouldn’t they look?) Bambrick-Santoyo
was more direct: “Standards are meaningless until you define how to assess
them.” And “instead of standards defining the sort of assessments used, the
assessments used define the standard that will be reached.” And again:
“Assessments are not the end of the teaching and learning process; they’re the
starting point.”

They are advising new teachers to put the cart
before the horse.”[iii][emphasis mine; the notes below are from our review]

Let's put this into the Newark context:

One of the most prominent "reforms" in Newark has been the closing of local public district schools while moving more students into charter schools like North Star.

By their own admission, these schools focus heavily on raising test scores.

The district also claims it has focused on aligning its curriculum with the PARCC (as I point out in our review, however, there is little evidence presented to back up the claim).

None of these "reforms," however, are necessarily indicators of improved instruction.

How did Newark get its small gains in value-added, most of which were concentrated in the year the state changed its tests? The question answers itself: the students were taught with the goal of improving their test scores on the PARCC. But those test score gains are not necessarily indicative of better instruction.

As Koretz notes in other sections of his book, "teaching to the test" can take various forms. One of those is curricular narrowing: focusing on tested subjects at the expense of instruction in other domains of learning that aren't tested. Did this happen in Newark?

More to come...

[i] Farr, S. (2010). Teaching as leadership; The highly effective
teacher’s guide to closing the achievement gap. San Francisco: Josey-Bass.
We note here that Russakoff reports that Teach for America received $1 million
of the Zuckerberg donation “to train teachers for positions in Newark district
and charter schools.” (Russakoff, D. (2016). The Prize; Who’s in charter of America’s schools? New York, NY:
Houghton, Mifflin, Harcourt. p. 224)

[ii]A
“Pythagorean Triple” is a memorized ratio that conforms to the Pythagorean
theorem regarding the ratio of the sides of a right triangle. Koretz critiques
the linear intercept rule, noting that b
is often taught as the intercept of an equation in high school, but is usually
the coefficient of an equation in college courses. In both cases, Kortez
contends test prep strategies keep students from gaining a full understanding
of the concepts being taught. See: Koretz, D. (2017) The testing
charade; Pretending to make schools better. Chicago, IL: University of
Chicago Press.pp.
104-108.[iii]Koretz, D.
(2017) The testing charade; Pretending to
make schools better. Chicago, IL: University of Chicago Press. p. 114-115.