What caring educator would not favor tests that allow students a choice in what they must answer?

What responsible college admissions officer wouldn’t grant applicants the right to withhold their SAT scores?

What committed Advanced Placement (AP) teacher wouldn’t expand access to as many students as possible?

What enlightened test developer wouldn’t prefer tests that identify each test-taker’s actual knowledge and skill levels and not those that just deliver a numerical score?

These aren’t just personal attitudes. They sway large organizations as well. The College Board repeatedly talks about “equitable access” to AP courses, and a 2008 report by the National Association for College Admission Counseling solemnly states, “there may be more colleges and universities that could make appropriate admissions decisions without requiring standardized admission tests such as the ACT and SAT,” and further, “some control clearly rests in the hands of postsecondary institutions to account for inequities that are reflected in test scores.”

But what happens when SAT scores are optional in college applications? When Bowdoin College allowed it, two results emerged, both predictable. One, applicants who withheld their numbers scored on average 120 points lower than did those who submitted their scores. Their withholding hence improved their applications, and it also boosted Bowdoin in the all-important U.S. News & World Report rankings (by making the average SAT score of the entering class look higher). But, two, the “withholders” hurt Bowdoin, for they performed 0.2 grade points worse than “submitters” did in first-year courses.

Or, what happens when tests allow students to choose the questions they answer, for instance, presenting a pool of essay questions from which test-takers choose two? First of all, you end up with inconsistencies: some questions are harder than others. And second, students often choose poorly, selecting the harder questions. Indeed, one study discovered, “the more that examinees liked a particular topic, the lower they scored on an essay they subsequently wrote on that topic!”

These outcomes belie the policies behind them, and they frustrate the generous souls who crafted the plans. Students end up performing worse than officials expected. Who wants to hear the bad news, though? Not many, and that’s precisely the complaint of distinguished statistician Howard Wainer in his book Uneducated Guesses, in which the preceding quotation and the Bowdoin case appear. Most educators stick to their faiths rather than follow the evidence, Wainer complains, and their stubbornness necessitates this blunt retort to education policies founded on bad evidence and good intentions. The volume’s subtitle, Using Evidence to Uncover Misguided Education Policies, describes the method. In 11 curt chapters, Wainer analyzes actual data and uncovers glitches, quirks, misconceptions, and unintended consequences of one practice after another, particularly those related to tests.

Each practice, from Computerized Adaptive Testing (CAT) to coscaling achievement tests, aims to solve a problem or address a need, but under Wainer’s withering assembly of numbers (scores, dollars, demographics), they collapse. He notes the discomfort people feel with the exclusive nature of AP courses, but wonders if it’s right to open them to students who have little chance of passing the exam. On principle, many would answer, “Give everyone a chance!” But, Wainer replies, such principles aren’t free. He takes the case of AP Calculus results in Detroit and estimates that if the city were to restrict the course to students who score 66 or above on the PSAT Math test, then the resulting cost per passing score on the AP test would be $1,167. If the city set the eligible score much lower, at 31, the cost per passing score would reach $4,513. “Would it be a better use of resources to provide a more suitable course for the students who do not show the necessary aptitude?” Wainer suggests.

In the case of CAT, educators favor the format because it calibrates questions to a student’s ability. If a test-taker misses a question, the next question shifts downward in difficulty. If he aces a question, the next one shifts upward. After a few dozen questions, the test identifies the competency level of the student—a better diagnostic than a simple percentage score. But it doesn’t allow test-takers to review and change an answer, which assessment experts consider important to accurate testing. If CAT does incorporate question review, Wainer warns, then when a subject finds an easy question pop up, he assumes he got the previous one wrong and backtracks to change it. Or worse, he deliberately answers every question wrong, ensuring easy questions all the way through. At the end, he returns to the beginning and answers every question correctly, yielding a near-perfect score. In other words, the very customization that educators praise allows savvy students to game the test. Wainer issued that caution in 1993, and he advises that we keep the original CAT because the benefits of item review don’t outweigh the risks of its abuse. Nevertheless, he notes, test specialists have pressed forward with item review since then—another case of hope overriding evidence.

However sharp and persuasive these exposés, though, they stand at a disadvantage, and Wainer knows it. This idea sounded so right, that innovation so sensible and fair, and watching them fail is depressing. Wainer summons evidence and reality against the modifications, but at stake is not just this and that policy but avid social hopes, sympathy for students, and feelings of injustice, too. One advocate for question review on CAT tests asserts that students “feel at a disadvantage when they cannot review and alter their responses,” their feelings apparently forcing a change in format. Another proponent begins with a basic condition of test-taking, namely, stress, leading the authors to craft methods that allow students more control over the test but that identify cheating (one recommendation they make is to limit changed answers to 15 percent of the total number of answers). Wainer cites both, but has only a dry reply taken from Albert Einstein: “Old theories never die, just the people who believe in them.”

Related EdNext Articles

I received an important clarification from Scott Hood in the communications office at Bowdoin:

“As you may know, Bowdoin’s Board of Trustees voted to drop the SAT as a requirement in October 1969, well before the first U. S. News rankings were published in 1983. As the first highly-selective college in the United States to make submission of the SAT optional, Bowdoin made its decision with care nearly 43 years ago. Proponents of the change noted the weak to modest correlation between SAT scores and a student’s performance in college. They felt that better predictors of success could be found through interviews, recommendations, transcripts, and the application itself. And they believed that the SAT was being overemphasized in the admissions process by both secondary schools and parents. There was never a thought that such a policy might one day favorably affect Bowdoin’s position in a yet-to-be-published set of rankings.”

Dear Professor Bauerlein,
Thank you for the kind and accurate review of my book Uneducated Guesses. I received the same note you did from Mr.Hood of Bowdoin. My reply to him contained the reminder:

‘On the bottom of page 10 in Uneducated Guesses I state,

“I don’t believe that this tactic was Bowdoin’s motivation, since it adopted this policy long before colleges were ranked according to SAT scores. But suspicions fall more heavily on recent adopters of optional SAT policies.”

You must also have noted recent disclosures from other schools detailing how they purposely reported overlarge SAT scores to USN&WR specifically to game the ratings. Obviously the subtle methods I discuss were neither efficient nor cynical enough.’