Testing Firm Hits Back Against Claims of Flaws

After a UT-Austin professor's research suggested a flaw in the design of the state's standardized tests, an official with the testing vendor said the firm welcomes an "open dialogue" based on well-founded evidence — but not what he called "wild conclusions."

The state's test vendor has fired back at research from a University of Texas at Austin professor that calls into question the method it uses to develop Texas standardized exams, saying that his claims lack factual evidence.

Denny Way, the senior vice president for measurement services at Pearson, said the company welcomed a “very broad and open dialogue” about the role of standardized testing in evaluating students and schools. But he said that should take place based on well-founded research, not professor Walter Stroup’s “wild conclusions,” which he said would not stand up to the review of outside experts.

The Texas Education Agency also issued a statement defending the standardized exams the state uses to evaluate school districts for accountability purposes. It said that both the STAAR, which ninth-grade students began taking this past school year, and the TAKS, which the state has used since 2003, are “are designed and administered in a transparent and highly scrutinized process” that involves hundreds of Texas educators and a committee of national experts in educational research and assessment, and “routinely undergo” reviews for technical quality.

“TAKS and STAAR were soundly designed to measure Texas state content standards,” according to the statement from the agency.

Stroup’s analyses, which he conducted with two other UT-Austin researchers, question how Pearson applies “item response theory,” a widely accepted method of devising standardized exams, to create the state’s TAKS exams. Using that method, he said, test developers select questions based on a model that correlates students’ ability with the probability that they will get a question right.

His research shows that when used in that way to create large-scale standardized tests, Stroup said, IRT produces an exam that is more sensitive to how it ranks students based on that model than to measuring any gains in their year-year learning.

Since news reports came out about the research last week, Stroup said that the dissertation on which his claims are based has been downloaded close to 1,000 times by academics across various research fields. He said he no one has pointed out a flaw that overturned any of the research’s central conclusions.

The findings — which he presented at a June meeting of the House Public Education Committee and is preparing to submit to research journals — suggest that because of the way the exams are developed, they cannot properly measure the effects of instruction in the classroom.

Way, who also released a statement on the company’s website, said Stroup’s work reflected confusion about how the test development process operates.

“I think you can ask questions from a public policy standpoint,” Way said. “But those are separate from saying that the test is flawed or there's a defect in the machinery, that the item response theory is wrong.”

He said the claim that a previous year’s test scores are a better predictor of students’ results the next year than their performance in the classroom, which Stroup said indicates that the tests cannot reflect what they learned over the school year, did not hold up. Any correlation between students’ scores probably reflected the fact that students are retaining what they’ve learned and building on that knowledge, he said.

The research has generated much discussion among assessment experts, many of whom have underscored the importance of moving the studies through the peer review process so that they can be properly vetted.

Stroup said he initially delayed preparing the findings for academics journals, which can take a year before publishing an article, because of assurances that the STAAR exams would address the issues his research pointed to in the TAKS. When it became clear that the same technique was used to develop the new assessment system, he felt the need to press ahead.

He said he regretted the fact that his work had not yet been published had distracted from an examination of its technical merit, but that he remained confident about its conclusions.

Response to Stroup’s claims has also focused on what Howard Everson, a former test developer and education professor at the City University of New York, said was a common misunderstanding among educators and policymakers about what large-scale, statewide accountability exams are intended to do.

“We design assessments to fit a very specific, clearly defined need. In the case of the accountability test, it is defined broadly to benchmark how schools and school districts in the state are doing from year to year,” said Everson, who has served as the vice president for research at the College Board, which produces the SAT and other admissions tests.

Because statewide accountability exams are created to compare school districts across the board, he said, they aren’t a good measure of how well specific instructional practices or curriculum programs are working within a single district.

“Typically they aren't that sensitive to instruction because the instruction varies from school district to school district,” he said.

The purpose of the exams is to provide a level way to compare districts’ performance, he said, and they shouldn't be used to evaluate individual, district-specific programs. But that doesn’t mean there is a flaw in the method they are put together, as Stroup’s research implies, he said.

"We always caution that you may not be able to make the kind of inferences you want to make from those scores," said Everson, who said that although he’s familiar with Stroup’s claims, he had not yet read the dissertation that sets forth his reasoning.

That warning can get lost as school officials and lawmakers turn to state assessment results as the simplest way to evaluate the quality of educational programs, Everson said. "It's an unfortunate fact that policymakers have to make decisions in real time, and they use whatever evidence is available, and oftentimes the evidence isn't really good for the use that they want to put it to," he said.

Stroup said that the conversation the research had generated, including Pearson’s response, made the need for a public hearing even more pressing. If the tests do not sufficiently measure the quality of instruction or other factors in the classroom, as he said his research shows, he said the vendor should justify their use in the state’s accountability system.

He said legislators should press the test developers to prove why they know that it is prior learning and “not a test-taking construct” that produces the similarity between scores.

“In the end, however it should be the vendor's responsibility,” he said. “And the vendor should have to do this in a public forum, not just somewhere at TEA.”

Texas Tribune donors or members may be quoted or mentioned in our stories, or may be the subject of them. For a complete list of contributors, click here.

Like this story?

Comment Policy

The Texas Tribune is pleased to provide the opportunity for you to share your observations about this story. We encourage lively debate on the issues of the day, but we ask that you refrain from using profanity or other offensive speech, engaging in personal attacks or name-calling, posting advertising, or wandering away from the topic at hand. To comment, you must be a registered user of the Tribune, and your user name will be displayed. Thanks for taking time to offer your thoughts.