Friday, December 22, 2006

Proficiency for All, Part I: A Brief History of the NAEP

I’ve talked before about how reaching 100% proficiency for all children in all subjects by 2014 is a mathematical impossibility, but this excellent article from the November 29th issue of Education Week does the best job I’ve seen of talking about both why it can’t happen and what that means for us all as we try to reform education in the United States. One of the pieces that’s the most interesting to me from it is this history that they give of the National Assessment of Education Progress (NAEP), which seems incredibly secretive for a test that is considered to be the “nation’s report card.” I’ve pasted that section below; the full article can be found in the follow-up link at the bottom of the post:

How did we get standards so divorced from reality, even for students in the middle of the distribution? Few Americans realize how unscientific the process for defining proficiency was—and must be. NAEP officials assembled some teachers, businesspeople, parents, and others, presented these judges with NAEP questions, and asked their opinions about whether students should get them right. No comparison with actual performance, even in the best schools, was required. Judges’ opinions were averaged to calculate how many NAEP questions proficient students should answer.

From the start, experts lambasted this process. When officials first contemplated defining proficiency, the NAEP board commissioned a 1982 study by Willard Wirtz, a former U.S. secretary of labor, and his colleague Archie Lapointe, a former executive director of NAEP. They reported that “setting levels of failure, mediocrity, or excellence in terms of NAEP percentages would be a serious mistake.” Indeed, they said, it would be “fatal” to NAEP’s credibility. Harold Howe II, a former U.S. commissioner of education responsible for NAEP’s early implementation, warned the assessment’s administrators that expecting all students to achieve proficiency “defies reality.”

In 1988, Congress ordered NAEP to determine the proficient score. Later, U.S. Sen. Edward M. Kennedy’s education aide, who wrote the bill’s language, testified that Congress’ demand was “deliberately ambiguous” because neither congressional staff members nor education experts could formulate it precisely. “There was not an enormous amount of introspection,” the aide acknowledged.

Others urged NAEP to wait. In 1991, Gregory Anrig, then the president of the Educational Testing Service, which administered NAEP, suggested delaying proficiency definitions until they could be properly established. Chester E. Finn Jr., an influential member of the NAEP governing board, responded that by delaying reports on how few students were proficient, “we may be sacrificing something else—the sense of urgency for national improvement.”

Once achievement levels were set, the government commissioned a series of evaluations. Each study denounced the process for defining proficiency, leading to calls for yet another evaluation that might generate a better answer.

The first such evaluation, conducted by three respected statisticians in 1991, concluded that “the technical difficulties are extremely serious.” To continue the process, they said, would be “ridiculous.” Their preliminary report said that NAEP’s willingness to proceed in this way reflected technical incompetence. NAEP fired the statisticians.

Congress then asked the U.S. General Accounting Office for its opinion. The GAO found NAEP’s approach “inherently flawed, both conceptually and procedurally.” “These weaknesses,” it said, “could have serious consequences.” The GAO recommended that NAEP results not be published using percentages of students who were allegedly basic, proficient, or advanced.

Proficiency for all, implying the elimination of variation within socioeconomic groups, is inconceivable. Closing achievement gaps, implying the elimination of variation between socioeconomic groups, is daunting but worth striving for.In response, the U.S. Department of Education commissioned yet another study, this one by the National Academy of Education. The panel concluded that procedures for defining proficiency were “subject to large biases,” and that levels by which American students had been judged deficient were “unreasonably high.” Continued use of NAEP proficiency definitions could set back the cause of education reform because it would harm the credibility of NAEP itself, the panel warned.

Finally, the Education Department asked the National Academy of Sciences to weigh in. It concluded, in 1999, that the “process for setting NAEP achievement levels is fundamentally flawed” and “achievement-level results do not appear to be reasonable.”

I still don’t quite understand the credence that has been put into the NAEP as being the be-all and end-all of assessment in America. The NAEP scores always come back low, but they are given a weight that I don’t know that they deserve. The Fordham Foundation especially has used the NAEP to show that state standards stink and schools aren’t making progress, but what if the test itself is an inaccurate measure?

There’s always going to be a disconnect, too, between the results of the NAEP and the results on the myriad statewide assessments used across the country, because they test different things. You might view this as a reason to go to a national curriculum and national standards ($), and certaily the argument can be had, but the state's rights argument is equally as strong and still resonates with many people.

‘Proficiency for All’ Is an OxymoronAccountability should begin with realistic goals that recognize human variability. By Richard Rothstein, Rebecca Jacobsen, & Tamara Wilder

The No Child Left Behind Act requires all students to be proficient by 2014. This is widely understood to be unattainable because 2014 is too soon. But there is no date by which all (or nearly all) students, even middle-class students, can achieve proficiency. “Proficiency for all” is an oxymoron.

—Peter LuiThe federal education legislation does not define proficiency, but refers to the National Assessment of Educational Progress. Although the Bush administration winks and nods when states require only low-level skills, the law says proficiency must be “challenging,” a term taken from NAEP’s definition. Democrats and Republicans stress that the No Child Left Behind law’s tough standards are a world apart from the minimum competency required by 1970s-style accountability programs.

But no goal can be both challenging to and achievable by all students across the achievement distribution. Standards can either be minimal and present little challenge to typical students, or challenging and unattainable by below-average students. No standard can simultaneously do both—hence the oxymoron—but that is what the No Child Left Behind law requires.

As the Harvard University professor Daniel Koretz, an expert on educational assessment and testing, has noted, typical variation in performance between those with lower and higher achievement is not primarily racial or ethnic; it is a gap within groups, including whites. Performance ranges in Japan and Korea, whose average math and science scores surpass ours, are similar to the U.S. range. If black-white gaps were eliminated in the United States, the standard deviation of test scores here would shrink by less than 10 percent. It would still be impossible to craft standards that simultaneously challenged students at the top, middle, and bottom.

The No Child Left Behind Act’s admirable goal of closing achievement gaps can only sensibly mean that achievement distributions for disadvantaged and middle-class children should be more alike. If gaps disappeared, similar proportions of whites and blacks would be “proficient”—but similar proportions would also fall below that level. Proficiency for all, implying the elimination of variation within socioeconomic groups, is inconceivable. Closing achievement gaps, implying the elimination of variation between socioeconomic groups, is daunting but worth striving for.

Not only is it logically impossible to have “proficiency for all” at a challenging level. The law and NAEP stumble further. Their expectations of proficiency are absurd, beyond challenging, even for students in the middle of the distribution. The highest-performing countries can’t come close to meeting the No Child Left Behind Act’s standard of proficiency for all. “First in the world,” a widely ridiculed U.S. goal from the 1990s that was supplanted by this federal legislation, is modest compared with the demand that all students be proficient.

On a 1991 international math exam, Taiwan scored highest. But if Taiwanese students had taken the NAEP math exam, 60 percent would have scored below proficient, and 22 percent below basic. On the 2003 Trends in International Mathematics and Science Study, 25 percent of students in top-scoring Singapore were below NAEP proficiency in math, and 49 percent were below proficiency in science.

On a 2001 international reading test, Sweden was tops, but two-thirds of Swedish students were not proficient in reading, as NAEP defines it.

How did we get standards so divorced from reality, even for students in the middle of the distribution? Few Americans realize how unscientific the process for defining proficiency was—and must be. NAEP officials assembled some teachers, businesspeople, parents, and others, presented these judges with NAEP questions, and asked their opinions about whether students should get them right. No comparison with actual performance, even in the best schools, was required. Judges’ opinions were averaged to calculate how many NAEP questions proficient students should answer.

From the start, experts lambasted this process. When officials first contemplated defining proficiency, the NAEP board commissioned a 1982 study by Willard Wirtz, a former U.S. secretary of labor, and his colleague Archie Lapointe, a former executive director of NAEP. They reported that “setting levels of failure, mediocrity, or excellence in terms of NAEP percentages would be a serious mistake.” Indeed, they said, it would be “fatal” to NAEP’s credibility. Harold Howe II, a former U.S. commissioner of education responsible for NAEP’s early implementation, warned the assessment’s administrators that expecting all students to achieve proficiency “defies reality.”

TalkBackJoin the related discussion, “'Proficiency for All' Is an Oxymoron .” In 1988, Congress ordered NAEP to determine the proficient score. Later, U.S. Sen. Edward M. Kennedy’s education aide, who wrote the bill’s language, testified that Congress’ demand was “deliberately ambiguous” because neither congressional staff members nor education experts could formulate it precisely. “There was not an enormous amount of introspection,” the aide acknowledged.

Others urged NAEP to wait. In 1991, Gregory Anrig, then the president of the Educational Testing Service, which administered NAEP, suggested delaying proficiency definitions until they could be properly established. Chester E. Finn Jr., an influential member of the NAEP governing board, responded that by delaying reports on how few students were proficient, “we may be sacrificing something else—the sense of urgency for national improvement.”

Once achievement levels were set, the government commissioned a series of evaluations. Each study denounced the process for defining proficiency, leading to calls for yet another evaluation that might generate a better answer.

The first such evaluation, conducted by three respected statisticians in 1991, concluded that “the technical difficulties are extremely serious.” To continue the process, they said, would be “ridiculous.” Their preliminary report said that NAEP’s willingness to proceed in this way reflected technical incompetence. NAEP fired the statisticians.

Congress then asked the U.S. General Accounting Office for its opinion. The GAO found NAEP’s approach “inherently flawed, both conceptually and procedurally.” “These weaknesses,” it said, “could have serious consequences.” The GAO recommended that NAEP results not be published using percentages of students who were allegedly basic, proficient, or advanced.

Proficiency for all, implying the elimination of variation within socioeconomic groups, is inconceivable. Closing achievement gaps, implying the elimination of variation between socioeconomic groups, is daunting but worth striving for.In response, the U.S. Department of Education commissioned yet another study, this one by the National Academy of Education. The panel concluded that procedures for defining proficiency were “subject to large biases,” and that levels by which American students had been judged deficient were “unreasonably high.” Continued use of NAEP proficiency definitions could set back the cause of education reform because it would harm the credibility of NAEP itself, the panel warned.

Finally, the Education Department asked the National Academy of Sciences to weigh in. It concluded, in 1999, that the “process for setting NAEP achievement levels is fundamentally flawed” and “achievement-level results do not appear to be reasonable.”

All this advice has been ignored—although now, every NAEP report includes a congressionally mandated disclaimer, buried in the text: “Achievement levels are to be used on a trial basis and should be interpreted with caution.” The disclaimer adds that conclusions about changes in proficiency over time may have merit, but not about how many students are actually proficient. Yet the same reports highlight percentages of students deemed below proficient or basic, and these, not the disclaimer, are promoted in NAEP’s press releases.

A curiosity of the No Child Left Behind legislation is that while it imposes sanctions on schools where all students are not proficient, it also acknowledges that NAEP proficiency definitions should be used only on a “developmental basis,” until re-evaluated. No re-evaluation has been performed.

Although the legislation implies that proficiency is as NAEP defines it, the law permits states to set their own proficiency levels. States use their own judges to imagine how students should perform. Widely differing conclusions of judges in different states is proof enough of how fanciful the process must be. States, no matter how well-intentioned, cannot perform psychometric miracles that are beyond the reach of federal experts.

State definitions now result in many states’ reporting far higher percentages of proficient students than NAEP does. Some states define proficiency in NAEP’s below-basic range. More will do so if the No Child Left Behind law’s requirement of proficiency for all continues.

Even then, the demand for proficiency for all cannot be met because of the inevitable distribution of ability in any human population. The federal law exempts only 1 percent of all students. From what we know of normal cognitive distributions, this means that students with IQs as low as 65 must be proficient; these cognitively challenged young people must do better in math than 60 percent of students in top-scoring Taiwan. Were proficiency standards lowered to NAEP’s basic level, children with IQs as low as 65 would be expected to perform better than the 22 percent of Taiwanese students whose achievement is below NAEP’s basic score.

Discussions of reauthorizing the now almost 5-year-old law typically propose to “fix” it: by crediting gains as well as levels, extending deadlines past 2014, fiddling with minimum subgroup sizes, giving English-learners more time. None of these can save the law unless we jettison the incoherent demand that all students be proficient.

We could design accountability with realistic goals that recognize human variability. Although research and experimentation is needed to determine practical and ambitious goals, we can imagine the outlines.

We might, for example, expect students who today are at the 65th percentile of the test-score distribution to improve so that, at some future date, they perform similarly to students who are now at the 75th; students who today are at the 40th percentile to perform similarly to those who are now at the 50th; and students who are at the 15th percentile to perform similarly to those who are now at the 25th. Such goals create challenges for all students and express our intent that no child be left behind.

Such goals would perhaps have to vary for subpopulations, ages, regions, and schools. The system would be too complex to be reduced to simple sound bites and administered by the highly politicized federal Department of Education.

The No Child Left Behind Act cannot be “fixed.” It gives us a “sense of urgency for national improvement” at the price of our intellectual integrity, and an unjustified sense of failure and humiliation for educators and students. It’s time to return to the drawing board.