Assessing Assessments — and Assessment Use

Images

Experts from around the world gathered at TC in March to debate whether standardized tests are used in fair and valid ways

By Patricia Lamiell

Cheating on U.S. standardized tests appears to be on the rise, even as the testing industry introduces tighter security measures each year and the penalties for cheaters grow ever more severe.

So, what to do?

To Eva L. Baker, an assessment expert at the UCLA Graduate School of Education and Information Studies, the answer is simple: publish the questions -- and the answers – in advance. Goodbye to safeguards that don’t work, Baker says – and hello to collaboration and peer learning among students.

Baker was among 28 invited presenters and stakeholders in late March at “Educational Assessment, Accountability and Equity: Conversations on Validity Around the World,” a conference co-sponsored by the Assessment and Evaluation Research Initiative (AERI) at Teachers College and the Educational Testing Service (ETS). Proposed and co-organized by Madhabi Chatterji, Associate Professor of Measurement-Evaluation and Education, and Director of AERI, the conference drew over 250 assessment experts, researchers and educators from around the world, including representatives of the prestigious International Association for the Evaluation of Educational Achievement (IEA), the World Bank, and UNESCO.

Standardized assessments are being used to evaluate and compare students, teachers, principals, schools and entire nations, with far-reaching fiscal and policy consequences. Yet the validity of test scores “frequently and repeatedly breaks down,” Chatterji told the assembly, because of a “widening gap” between assessment professionals who work in the rarified world of test development and psychometrics research, on the one hand, and non-technical end- users such as policy-makers, educators, the public and the media. “The validity of assessment information depends not only on how tests are designed and formally validated,” Chatterji said, but on how the results are “eventually put to use in everyday practice, policy, research, or other – sometimes politically charged – contexts.”

For example, to be eligible for federal “Race to the Top” funding, state education departments must use students’ test scores as a basis for evaluating teachers for salary increases, promotion and, potentially, firing. And because the federal No Child Left Behind law requires “adequate yearly progress” toward a goal of 100 percent proficiency in math and reading based on student test scores, the fate of an entire school could be on the line, as well. Whether or not test scores can be defensibly used for such actions, Chatterji says, depends on a series of technical actions taken during test development, as well as on0 the evidence that is collected to show what kinds of test uses are valid. Considerations about what tests can and cannot do are rare when there is a societal push for ever more test-based accountability.

With stakes so high, even a test that accurately measures the performance area it was designed to measure may not be fair, said Michael T. Kane, Samuel J. Messick Chair in Test Validity at ETS. For example, a test may fail to take into account socioeconomic factors that can affect performance. Or it may measure something that, while accurate, is beside the point.

Aaron Pallas, TC Professor of Sociology and Education, said he is “concerned” that, at least in the United States, the goal and monitoring of continuous program improvement “could become an end in itself, rather than a means to an end.” Documenting the collection and use of data “could easily overshadow the importance of figuring out the right learning objectives,” Pallas said.

The conference also focused on the issue of how the test scores of U.S. students stack up in global comparative tests such as the Programme of International Assessment (PISA), which is periodically administered to 15 year-olds around the world, and the Trends in International Mathematics and Science Study (TIMMS), which is given to fourth- and eighth-graders. In March, a task force led by former U.S. Secretary of State Condoleezza Rice and former New York City Schools Chancellor Joel Klein released a report saying that the mediocre performance of American children on international tests represents nothing less than a national security threat. But Michael Feuer, Dean of George Washington University’s Graduate School of Education and Human Development and former analyst at the Congressional Office of Technology Assessment, said that while American performance on international assessments has been disappointing, it “is not exactly as embarrassing or quite as ugly as one would sense from reading the popular press.” American educators and the testing industry need to figure out whether the nation’s core values are embedded in the international testing programs, Feuer said – in which case, the poor U.S. showing is of greater cause for concern – or whether TIMMS and PISA should be used to stimulate a discussion about “what we value in education. Framing the discussion in the latter way “makes a big difference for the ways in which the results are reported and the ways in which they seep into the policy discourse,” Feuer said.

Baker said the testing community has come under increasing pressure to codify and measure human capacities and knowledge. The revision 1999 revision of “Standards for Educational and Psychological Testing, which she helped lead, spells out responsibilities of both test-makers and test users to ensure valid and appropriate test use. Baker said test designers have been “swamped” by directives to develop assessments that have advanced the agenda shared by the George W. Bush and Obama administrations to hold teachers and schools accountable for the success of their students, and to help colleges and universities screen applicants by predicting their future progress. She criticized assessments that measure memorization skills instead of more valuable “adaptive problem-solving” abilities.

The transformation of standardized tests and assessments from measurement tools to policy tools has also corrupted the validity of the tests themselves and created widespread mistrust of test results, even among professionals, said Kevin Welner, Professor of Education at the University of Colorado at Boulder.

“If the core meaning of results of tests is to evaluate the teachers’ job, then test preparation becomes the teacher’s job,” said Welner, who has corresponded with President Obama on this topic.

Case in point, according to Leo Casey, Vice President for Academic High Schools at the United Federation of Teachers in New York City: recent New York City test scores were “wildly inflated” and had such wide margins of error (35 points for math and 53 points for English Language Arts) that they were statistically useless. Casey characterized these results as essentially ammunition for the City’s school restructuring plan. “The quality of the data isn’t important,” Casey said. “What is important is that they can be used to close schools and fire teachers.”

Yet despite such unresolved issues, said Nicholas Lemann, Dean of the Columbia School of Journalism and author of a history of the college entrance examination, Scholastic Aptitude Test (now called the Scholastic Assessment Test), “It is frustrating to see public debate so completely unconcerned with what people in this room would mean by validity.” Lemann added that “the press can’t tell the difference between good and bad [test] numbers.”

Nevertheless, it would be a mistake to stop using standardized tests altogether, said former New York State Education Commissioner David Steiner. Properly constructed, these tests can shine valuable light on areas where students are doing well and areas where they are not, Steiner said. Still, the United States would do well to follow the example of Finland, whose top-performing educational system succeeds because “there is a deeper trust” of both teachers and students. Finnish teachers are given autonomy to plan curricula and to teach, Steiner said; children are given the time and freedom to explore on their own; and the public does not rely on tests as the only validation of the results.

Supported by the National Science Foundation and other sponsors, AERI will continue, to promote meaningful use of assessment and evaluation information around the world. Policy briefs will be generated from the conference’s proceedings, and the presented papers will be edited by Chatterji and assembled in a special volume and a dedicated journal issue.

“I was impressed with the integrity with which people on both sides dealt with the issue of test use and misuse,” Chatterji said. “There was a lot of soul-searching going on about validity.”