Posts Tagged teacher evaluation

The paper “The Problem With “Proficiency”: Limitations of Statistics and Policy Under No Child Left Behind” by Andrew Dean Ho published in Educational Researcher, Vol. 37, No. 6, pp. 351–360 (August/September 2008), makes the salient point that interpretation and use of cut scores, in particular the percentage Proficiency scores (PPS), gives inherently misleading accounts of progress and comparisons among groups.

The problem is not caused by intentional bias on the part of psychometricians or those determining the cut scores, though Ho does imply it could be used, and perhaps has been used, by those knowledgeable of this problem to give biased results.

In the following I summarize the problem he describes and later reference a Mathematica CDF file I will design and make available that simulates, dynamically, the effects of cut score placement and test score distribution changes to illustrate the principles discussed.

“The Percentage of Proficient Students (PPS) is a conceptually simple score-reporting metric that became widely used under the National Assessment of Educational Progress (NAEP) in the 1990s (Rothstein, Jacobsen, & Wilder, 2006). Since 2001, PPS has been the primary metric for school accountability decisions under the No Child Left Behind (NCLB) Act. In this article, through a hierarchical argument, I demonstrate that the idea of proficiency— although benign as it represents a goal encourages higher order interpretations about the progress of students and schools that are limiting and often inaccurate. I show that over-reliance on proficiency as a reporting metric leads to statistics and policy responses that are overly sensitive to students near the proficiency cut score….” Ho, p. 351.

Assume Cut1 is at -1.5, and Cut2 is at 0.5. Then imagine two different scenarios. In the first scenario, assume the student 4th grade math WKCE scores fall at Cut1 for first year 1, and in year 2 the 4th grade class scores are 0.5 stdev higher. Visually, the cut score holds its position on the horizontal line while the whole bell curve shifts to the right by 0.5 stdev. A comparison between year 1 and year 2 would show a 4.4% improvement as this group would move into the next category. Cut1 would now strike at the -2.0 stdev point. Note also that no improvements will be recognized for about 95% of the students, yet all improved by the same amount.

In the second scenario, the students scores fall instead at Cut2, the first year, and in the second year, Cut2 falls at the 0.0 stdev point due to the 0.5 stdev improvement. In this case the year 1 and year 2 comparison would show a 19.1% improvement, 81% showing no improvement.

These scenarios show that for the identical improvement, that is all kids improved by 0.5 stdev in both scenarios, the perceived improvements show mediocre in scenario 1 and spectacular in scenario 2. Neither interpretation is wrong, but the results are grossly misleading. If one makes teacher evaluations, or AYP decisions based on either scenario, then the resulting consequences will be wrong in both scenarios.

The key understanding one needs to take away from these scenarios is for any cut score like Cut2 sitting right of the bell curve peak (the mean, median and mode), as the bell curve shifts smoothly right comparisons will make it seem that the schooling is getting better by the year, that the schools and teachers have found the magic solution. This will be a false conclusion. Once the bell curve moves to the point where the cut score is to the left of the peak, the rate of improvements will decrease, rapidly at first, decreasing less rapidly as the bell curve shifts further right. Any conclusion that the teachers are losing their edge, the curriculum needs to be changed or some heads must roll would be wrong. All such effects seen are an artifact of cut scores’ interaction with a bell curve, and nothing more.

The above logic applies to every cut score and every demographic subpopulation. Basic, minimum, proficient and advanced will be different as will rates for different ethnicities as will rates for different schools and school districts. Without more, interpretations are guaranteed to be wrong.

There is a rule here that must be exercised. No statistic can be understood unless and until it is related back to the original data. That is, in order to make sense of any statistic, and in particular, test outcomes, it is necessary to have either the full distributional information (the original scores) or the basic distribution statistics such as counts, mean, median, variance, skewness and kurtosis for each category and subgroup that would allow each of the distributions to be simulated.

Attached is a preliminary document called “wkce-simulation” that begins to look at the wkce distributions and cut-scores in PDF and CDF formats.