Is High-Stakes Testing Working?

By Jonathan Supovitz

Test-based accountability systems — the use of tests to hold individuals or institutions responsible for performance and to reward achievement — have become the cornerstone of U.S. federal education policy, and the past decade has witnessed a widespread adoption of test-based accountability systems in the U.S. Consider just one material manifestation of this burgeoning trend: test sales have grown from from approximately $260 million annually in 1997 to approximately $700 million today — nearly a threefold increase.

What, we should ask, has all that money bought? Research shows that high-stakes assessments can and do motivate change in instructional practice. But critics charge that these changes tend to be superficial adjustments, focused on the content covered and test preparation rather than deep improvements to instructional practice.

What influence has our substantial investment in testing and test-based accountability policy had on the behavior and performance of American educators? Is high-stakes testing a substantive reform, or an intervention that reveals shortcomings in the system but does little to actually improve instructional practice? In short, do high-stakes tests work?

Why We Test

Four major theories underlie our current reliance on high-stakes tests: motivational theory, which argues that test-based accountability can motivate improvement; the theory of alignment, which contends that test-based accountability can spur alignment of major components of the educational system; information theory, holding that such systems provide information that can be used to guide improvement; and symbolism, which maintains that such a system signals important values to stakeholders.

Motivational theory is the predominant theory underlying test-based accountability. According to this concept, the extrinsic rewards and sanctions associated with the high-stakes test serve to motivate teachers to improve their performance. This presumes that educators require external pressure to improve their teaching. For those educators who already have a strong internal sense of responsibility to their profession, the research is inconclusive about the effects of external pressure. Some researchers have found that reward does not decrease intrinsic motivation (Cameron and Pierce, 1994), while others have concluded that tangible rewards often undermine internal motivations (Deci et al.1999).

The theory of alignment holds that system-wide improvement is most likely to occur if educators align the major components of the educational system (standards, curriculum, and assessments) surrounding schools so that they reinforce each other. Alignment is usually thought of in terms of synchronizing the surrounding system, but can also be thought of as alignment between the external accountability of schools and schools' sense of internal accountability (Abelmann and Elmore, 2004).

Information theory maintains that student performance data are useful for teachers and administrators to make decisions about students and programs and that providing such data to local educators and giving them incentives to improve their performance will guide classroom and organizational decision-making.

Symbolism theory has also contributed to the growth and prevalence of high-stakes testing. In this model, the accountability system is seen to signal important values to stakeholders and, in particular, the public. This particular theory is manifested in the notion of "public answerability" — that is, the idea that the public has a right to expect its resources to be used responsibly and that public institutions are accountable for caretaking the public trust. High-stakes assessments thus serve as evidence that public education is, in essence, responsible and rigorous and further provide symbolic of the system.

The Movement Toward Measuring Outcomes

Over the last 20 years, the nation has shifted from educational input tracking (e.g., per pupil expenditures, teacher salaries, class size, required courses, seat time) as indicators of educational performance, to an increased emphasis on testing as a means to hold schools accountable for educational outcomes. Standardized test results have become the indicators of school and student performance, with public reporting, monetary or nonmonetary rewards, and a range of interventions for low-performing schools as consequences for excellent or poor performance (Elmore et al. 1996; Furman and Elmore 2004).

By the early 1990s, standardized, multiple-choice high-stakes testing came under siege from many constituencies for containing gender bias, ethnic prejudice, and socioeconomic favoritism. Critics bemoaned the narrowing of curriculum and instruction and the perverse incentives inherent in high-stakes testing to retain and reclassify students. Many maintained that multiple-choice testing, with its emphasis on recall of isolated bits of knowledge, represented an outdated behaviorist view of learning. Moreover, research confirmed many of these critiques.

In an effort to address these problems, educators introduced a bevy of alternative forms of assessment (portfolios, performance assessments, and open-ended tasks). Advocates saw them as more valid measures of student performance and as a potential catalyst for school reform. As several states and national organizations began incorporating alternative forms of assessments into their test-based accountability systems, researchers examined the influence of these assessments on both motivation and alignment. Findings showed that teachers organized instruction around the timing of high-stakes assessments (Borko and Elliott 1999), that teachers reported that performance assessments influenced curricular activities and assessment practices (Lane et al. 1999), and that teachers began preparing students for the test rather than the larger learning goals in the curriculum (Stecher and Barron 1999).

Research into the potential for alternative assessments to deliver richer, less biased measures was mixed. Scoring reliability was found to be high in science performance assessments (Baxter et al. 1992) but unreliable in portfolio assessments (Koretz et al. 1994). Portfolio assessments were found to reduce racial/ethnic gaps in performance but exacerbate gender differences (Supovitz and Brennan 1997). Performance task content produced gender-related biases (Jovanovic et al. 1994). Alternative assessments were also found to be cost prohibitive. For example, the cost of large-scale science performance assessments in California were found to be 20 to 60 times more expensive than standardized multiple-choice assessments for an equally reliable score (Stecher and Klein 1997). While collective research dampened the enthusiasm for alternative assessments, some elements were incorporated into high-stakes testing, including open-ended writing and performance tasks.

The NCLB Era

In 2001, test-based accountability was incorporated into the national No Child Left Behind Act (NCLB), a major reform intended to bring about widespread improvements in student performance and reduce inequities between ethnic groups and other traditionally under-served populations. This legislation required states to adopt test-based statewide accountability systems, testing annually in reading, math, and eventually science from grades 3 through 8 and one year of high school. States were to define proficiency and adequate yearly progress to get all students to proficiency in 12 years. Schools that failed to make adequate yearly progress for two consecutive years would be identified for improvement and students from those schools would have the right to transfer to another public school. The legislation required measureable objectives for sub-groups of students and for states to certify teachers as highly qualified.

Studies and analyses on NCLB are beginning to emerge. A four-year analysis conducted by the Center for Education Policy (Renter et al. 2006) surveyed state policymakers, district administrators, and school case studies. Many of the study’s respondents credited NCLB for rising student performance. At the same time, they indicated a narrowing of curriculum with a focus on reading and math that has reduced instructional time for other subjects, a shift reflecting efforts to align curriculum and instruction with state academic standards and assessments.

Likewise, one synthesis of pre- and post-NCLB literature (Herman 2004) concluded that while accountability certainly attracts teachers’ attention, teachers were more influenced by testing than by the standards themselves. In practice, test preparation merges with instruction, with a concurrent de-emphasis of non-tested content.

Some evidence suggests improvements in national performance associated with test-based accountability. But credit for these gains has also been given to school district policies and programs, as well as to state attention to test-based accountability. Thus, while performance is improving, the contribution of high-stakes testing remains unclear.

Still other research has explored ways to use data from high-stakes tests to improve instruction — and again the verdict on the value of these assessments is mixed. These studies have typically found that data provide general information about student performance but lack the nuance to provide fine-bore instructional guidance. In response to such analyses, many districts have moved to more frequent quarterly or benchmark assessments.

One promising line of research suggests that short- and medium-cycle formative assessments have a role to play in instructional improvement. Long-cycle assessments, which give feedback beyond the instructional unit, are better used for accountability purposes. In this model, state tests are seen as national map, while district tests provide a compass for improvement, and classroom assessments a GPS unit for localized intervention (Supovitz and Klein 2003)

What Have We Learned?

What has the last decade of testing policy taught us about the likelihood of improving the education system through high stakes testing? Research shows that high-stakes assessments can and do motivate change in teachers’ instruction. But that these changes tend to be superficial adjustments of practice that are often focused on modifications in content coverage and test preparation practices rather than deep improvements to instruction efforts. Let's review the effects according to the four theories of testing policy described earlier:

1) High-stakes testing does motivate educators, but responses are often superficial. In the best cases, high-stakes testing has focused instruction toward important and developmentally appropriate literacy and numeracy skills — but at the expense of a narrower curricular experience for students and a steadier diet of test preparation activities in the classroom.

2) Test-based accountability fosters alignment of the central components of the educational system. The evidence does suggest that high-stakes testing encourages educators to align curriculum, standards, and assessments. Although we do seem to face a chicken-and-egg conundrum when trying to determine whether curriculum and standards are being aligned to the tests or vice versa, alignment is producing a more coherent education system.

3) High-stakes testing regimes have limits as information tools. The data from high stakes tests are useful to policymakers for assessing school and system-level performance but insufficient for individual-level accountability and provide meager information for instructional guidance.

4) Test-based accountability is an appealing political strategy. High-stakes testing answers a real need for the education system to demonstrate that it is spending public dollars judiciously.

Where Do We Go From Here?

Over the last couple of decades, test-based reform in the U.S. has gone through two major cycles: first, a widespread exploration into a variety of alternative assessment forms, then an increased emphasis on annual testing and State test performance as the authoritative indicator of the quality of schools and districts. Despite their real differences, these two cycles were each born of the best intentions: a desire to raise the performance of students and to redress inequalities in the educational system. But these disparities are driven by our social priorities, not our educational system, and they will not be remedied by rewriting the tests.

Rather than investing in substantial efforts to improve teaching and learning, we have created a system that values summative testing as the cure to what ails us. These tests certainly have a place in our efforts to improve our schools, but the inflated role they currently play is indicative of our lack of will to enact deeper reform. In short, we have been effective in motivating educators through high-stakes testing — but we have done so without providing clear direction about exactly what we're motivating them to do.

In the next decade, developments in two areas may determine how much progress we make. First, we need to embed information about patterns of student understanding into assessments that can be used by teachers for instructional guidance.Teachers need better information about student subject matter (mis)conceptions and problem-solving strategies and the capacity to use this information to guide their instructional responses and assessments are an important tool to provide this information. Insight into how students are thinking about content is necessary to shape an effective instructional response. Without clues about how students arrived at an answer, teachers are much less able to craft a response that moves student understanding forward.

Second, we need to reform the reform. We must find a way to assimilate short-, medium-, and long-cycle assessments into a more coherent system that takes advantage of the strengths of each and ameliorates the undue influence that a single high-stakes assessment carries. A more robust assessment system might begin in the schools with more formative assessments, continue with a set of curriculum-related interim assessments that act like guideposts, and culminate in a summative annual assessment.

While technological advances make the integration and standardization of such concepts more feasible than ever before, the political challenges are daunting. But if we are serious about improving the education our young people receive, we must relegate high-stakes accountability to its proper place as a measurement and incentive companion to deeper instructional reform.

Jonathan Supovitz, an associate professor at Penn GSE and senior researcher for the Consortium for Policy Research in Education, focuses his research on how education organizations use different forms of evidence to inquire about the quality and effect of their systems to support the improvement of teaching and learning in schools. This article summarizes Can High Stakes Testing Leverage Educational Improvement? Prospects from the Last Decade of Testing and Accountability Reform, which appeared in The Journal of Educational Change, 10(2-3).

Lane, S., Ventrice, J., Cerillo, T.L., Parke, C.S., & Stone, C.A. (1999). Impact of the Maryland School Performance Assessment Program (MSPAP): Evidence from the principal, teacher and student questionnaires (reading, writing, and science). Paper presented at the Annual Meeting of the National Council on Measurement in Education, Montreal, Canada.