September 9, 2008

Cold permutations

Semi-success on the reserving-time front. I had a lunch meeting and then a 3 pm meeting, and the time in between was too short to do much, so I exchanged one parking sticker for another. Whee. At least my wonderful grad student assisting with the journal did a monster job helping on a long MS, giving my head-cold-affected mind a much easier job going through the next article. I WILL climb on top of this mountain of work. Just not today.

It's a semi-full-blown cold now. Proof: I should be asleep, and I'm exhausted, but I can't sleep.

I've been trying to wrap my mind around permutation tests and exchangeability for about a week, and I figure that my typical head-cold mentality may be the best shot I can take at it both in terms of the orthogonal way I think at way-too-late-on-a-head-cold evening and also the fact that once I'm up this late and in this state, no student or MS author wants me to be making decisions right now. (For the record, I'm on antihistamines. I know, I know: Never take Benadryl and grade. No. That's not funny, not even in my state of mind.)

A few weeks ago, I was pondering the NYC achievement gap controversy, a debate over the summer that among other things spawned a Teachers College Record commentary by Jennifer Jennings and me (available just to subscribers for now, but to the world in a few weeks). And while the limits on TCR commentaries and op-eds require a fairly narrow argument, I kept thinking about trends and time series data as I looked at the New York City Department of Education's claims. I kept thinking to myself, There has to be something an historian can contribute to this debate that is specific to the way historians think. I'll probably write something at length when I'm more coherent and have some time, but there was an obvious answer that came to mind: to historians, the order of events matter. An argument about causality depends on contingency which depends on a sequence. (Historians often focus on contingency rather than causality, except when we're playing the counterfactual game. The obvious answer to the question, "What caused Gore's defeat in 2000?" is "everything, or almost everything.") The sequence doesn't prove causality (or contingency), but it's necessary.

That logic is usually not applied in policy. In the case of New York City, as is typical in this type of reform publicity, someone pointed to a time series of data and claim, "Aha! See this trend? Ignore its tentative nature: it's PROOF that we're on the right track." One obvious problem with the NYC data is the reliance on threshold-passing percentages; that's the focus of the TCR commentary. But the NYC Department of Education made claims about the achievement gap more broadly, and the data is a lot messier than the folks in Tweed would state. Below are three permutations of the "z-scores" of achievement gaps (the differences in Black-White means on the 4th-grade state math tests, scaled to the population's standard deviation). One is the real time series that runs between 2002 and 2008. The other two are permutations. Before you look for the data (it's on p. 13 of the PDF file linked above), see if you can tell the differences among them, and which is the observed order:

0.740.790.730.670.720.670.71

0.790.670.720.670.710.740.73

0.790.720.710.740.730.670.67

My professional judgment as an historian is also common sense: if the order of events does not make a discernible difference, even if you ignore measurement error and standard errors, then it's hard to conclude that there's a trend. How to test that is the realm of statistics, and when I explained the issue to my colleagues Jeffrey Kromrey and John Ferron, the answer from them was clear: permutation tests. That's a general family of nonparametric tests of inference that's the formal version of the question I asked: if you jumble up the data in all the possible ways they could be permuted, and if you look at a particular measure of interest (a test statistic), where in the distribution of all permutations does the observed data set fall? In the case of the 4th grade Black-White gap on New York state math tests measured as a z-score, we have 7 points of data, which have 7! = 5040 permutations. If you choose an appropriate test statistic for each permutation and the observed time series is about 125 from either end of the distribution, that excludes the 95% or more permutations in the middle of the distribution.

No, I haven't had the time or inclination to follow up, learn how to calculate one of the possible test statistics and how to get the R statistics program to do a permutation test. There are two problems, as I've learned from my colleagues: choosing the right test statistic is a matter of art as well as science; and there may be a problem with exchangeability. As far as I understand it, exchangeability is a less constricting assumption than the standard "independent, identically-drawn" sample assumption in parametric inferential statistics. From what I understand, the practical definition of exchangeability means roughly that you could theoretically exchange all the data points without screwing up the distribution. Again, if I understand correctly, one situation that violates the assumption of exchangeability is in autocorrelated datai.e., when one data point influences the next one (or the next few). And if there's anything that's likely to be autocorrelated, it's a time series. That's not a serious problem if you're just looking to see if a trend exists at all; for that, autocorrelation is a form of trend (though an artifactual one). But if you're trying to make causal inferences or anything more complicated when there's autocorrelation (i.e., if achievement data levels or trend slopes are different before and after a policy change), I think you have to throw permutation tests out the window.

And that's such a shame, because the concept is still right when extended beyond the question of a trend: if a policy makes a difference, then it should make a difference on which side of the policy change you're sitting. So if you're a clever person with statistics, please provide some ideas in comments for where to go with this or if, as I suspect, the best we can do with permutation tests is ruling out possible trends/autocorrelation.