Monday, January 14, 2008

We think that TV newsman John Merrow is mistaken when, in an Education Week opinion piece (“Learning Without Loopholes”, December 4, 2007), he says it is inappropriate for states to use a “margin of error” in calculating whether schools have cleared an AYP hurdle. To the contrary, we would argue that schools don’t use this statistical technique as much as they should.

Merrow documents a number of cynical methods districts and states use for gaming the AYP system so as to avoid having their schools fall into “in need of improvement” status. One alleged method is the statistical technique familiar in reporting opinion surveys where a candidate’s lead is reported to be within the margin of error. Even though there may be a 3-point gap, statistically speaking, with a plus-or-minus 5-point margin of error, the difference between the candidates may actually be zero. In the case of a school, the same idea may be applied to AYP. Let’s say that the amount of improvement needed to meet AYP for the 4th grade population were 50 points (on the scale of the state test) over last year’s 4th grade scores. But let’s imagine that the 4th grade scores averaged only 35 points higher. In this case, the school appears to have missed the AYP goal by 15 points. However, if the margin of error were set at plus-or-minus 20 points, we would not have the confidence to conclude that there’s a difference between the goal and the measured value.

(Margin of Error bar graph) What is a margin or error or “confidence interval”? First of all, we assume there is a real value that we are estimating using the sample. Because we don’t have perfect knowledge, we try to make a fair estimate with some specified level of confidence. We want to know how far the average score that we got from the sample (e.g., of voters or of our 4th grade students) could possibly be from the real average. If we were, hypothetically, to go back and take lots of new samples, we assume they would be spread out around the real value. But because we have only one sample to work with, we do a statistical calculation based on the size of the sample, the nature of the variability among scores, and our desired level of confidence to establish an interval around our estimated average score. With the 80% confidence interval that we illustrated, we are saying that there’s a 4-in-5 chance that the true value we’re trying to estimate is within that interval. If we need greater confidence (for example, if we need to be sure that the real score is within the interval 95 out of a 100 times), we have to make the interval wider.

Merrow argues that, while using this statistical technique to get an estimated range is appropriate for opinion polls, where a sample of 1,000 voters from a much larger pool is used and we are figuring by how much the result may change if we had a different sample of 1,000 voters, the technique is not appropriate for a school, where we are getting a score for all the students. After all, we don’t use a margin of error in the actual election; we just count all the ballots. In other words, there is no “real” score that we are estimating. The school’s score is the real score.

We disagree. An important difference between an election and a school’s mean achievement score is that the achievement score, in the AYP context, implies a causal process: Being in need of improvement implies that the teachers, the leadership, or other conditions at the school need to be improved and that doing so will result in higher student achievement. While ultimately it is the student test scores that need to improve, the actions to be taken under NCLB pertain to the staff and other conditions at the school. If the staff is to blame for the poor conditions, we can’t blame them for a range of variations at the student level. This is where we see the uncertainty coming in.

First consider the way we calculate AYP. With the current “status model” method, we are actually comparing an old sample (last year’s 4th graders) with a new sample (this year’s 4th graders) drawn from the same neighborhood. Do we want to conclude that the building staff would perform the same with a different sample of students? Consider also that the results may have been different if the 4th graders were assigned to different teachers in the school. Moreover, with student mobility and testing differences that occur depending on the day the test is given, additional variations must be considered. But more generally, if we are predicting that “improvements” in the building staff will change the result, we are trying to characterize these teachers in general, in relation to any set of students. To be fair to those who are expected to make change happen, we want to represent fairly the variation in the result that is outside the administrator’s and teachers’ control, and not penalize them if the difference between what is observed and what is expected can be accounted for by this variation.

The statistical methods for calculating a confidence interval (CI) around such an estimate, while not trivial, are well established. The CI helps us to avoid concluding there is a difference (e.g., between the AYP goal and the school’s achievement) when it is reasonably possible that no difference exists. The same technique applies if a district research director is asked whether a professional development program made a difference. The average score for students of the teachers who took the program may be higher than the average scores of students of (otherwise equivalent) teachers who didn’t. But is the difference large enough to be clearly distinct from zero? Did the size of the difference escape the margin or error? Without properly doing this statistical calculation, the district may conclude that the program had some value when the differences were actually just in the noise.

While the U.S. Department of Education is correct to approve the use of CIs, there is still an issue of using CIs that are far wider than justified. The width of a CI is a matter of choice and depends on the activity. Most social science research uses a 95% CI. This is the threshold for the so-called “statistical significance,” and it means that the likelihood is less than 5% that a difference as large or larger than the one observed would have occurred if the real difference (between the two candidates, between the AYP goal and the school’s achievement, or between classes taught by teachers with or without professional development) were actually zero. In scientific work, there is a concern to avoid declaring there is evidence for a difference when there is actually no difference. Should schools be more or less stringent than the world of science?

Merrow points out that many states have set their CI at a much more stringent 99%. This makes the CI so wide that the observed difference between the AYP goal and the measured scores would have to be very large before we say there is a difference. In fact, we’d expect such a difference to occur by chance alone only 1% of the time. In other words, the measured score would have to be very far below the AYP goal before we’d be willing to conclude that the difference we’re seeing isn’t due to chance. As Merrow points out, this is a good idea if the education agency considers NCLB to be unjust and punitive and wants to avoid schools being declared in need of improvement. But imagine what the “right” CI would be if NCLB gave schools additional assistance when identified as below target. It is still reasonable to include a CI in the calculation, but perhaps 80% would be more appropriate.

The concept of a confidence interval is essential as schools move to data-driven decision making. Statistical calculations are often entirely missing from data-mining tools, and chance differences end up being treated as important. There are statistical methods such as including pretest scores in the statistical equation for making calculations more precise and for narrowing the CI. Growth modeling, for example, allows us to use student-level (as opposed to grade-average) pretest scores to increase precision. School district decisions should be based on good measurement and a reasonable allowance for chance differences. —DN/AJ