Their study is titled “International Tests Show Achievement Gaps in All Countries, with Big Gains for U.S. Disadvantaged Students.” It includes not only their major analysis of international test scores, but critiques by the leaders of OECD and PISA, and their response to the critiques.

This important study should change the way international tests are reported by the media, if they take the time to read Rothstein and Carnoy.

In every nation, students from the most affluent homes are at the top of the test scores, and students from the poorest homes are at the bottom. In other words, there is an “achievement gap” based on social class in every nation.

They point out that the big assessment programs—PISA and TIMSS—do not consistently disaggregate by social class, which creates “findings” that are misleading and inaccurate.

Rothstein and Carnoy note that American policymakers have been disaggregating by income and other measures since No Child Left Behind was passed, yet they gullibly accept international test score data without insisting on the same kind of disaggregation.

In other words, we know that a school where most of the students live in affluent, college-educated families will get higher test scores than a school in an impoverished neighborhood. But we don’t ask the same questions when we look at international testing data.

Rothstein and Carnoy diligently asked those questions and reached some very interesting conclusions.

*”The share of disadvantaged students in the U.S. sample was larger than their share in any of the other countries we studied. Because test scores in every country are characterized by a social class gradient—students higher in the social class scale have better average achievement than students in the next lower class—U.S. student scores are lower on average simply because of our relatively disadvantaged social class composition.” In other words, we have more poverty than other nations with which we compare ourselves, and thus lower scores on average.

*They discovered that “the achievement gap between disadvantaged and advantaged children is actually smaller in the United States than it is in similar countries. The achievement gap in the United States is larger than it is in the very highest scoring countries, but even then, many of the differences are small.”

*The achievement of “the most disadvantaged U.S. adolescents has been increasing rapidly, while the achievement of similarly disadvantaged adolescents in some countries that are typically held up as examples for the U.S.—Finland for example—has been falling just as rapidly.” (I asked Rothstein whether the gains were attributable to NCLB, and he replied that the gains for the most disadvantaged students were even larger prior to NCLB.)

*The U.S. scores on PISA 2009 that so alarmed Secretary Duncan were caused by a sampling error. “PISA over-sampled low-income U.S. students who attended schools with very high proportions of similarly disadvantaged students, artificially lowering the apparent U.S. score. While 40 percent of the PISA sample was drawn from schools hwere half or more of students were eligible for free and reduced-price lunch, only 23 percent of students nationwide attend such schools.”

*If the PISA scores are adjusted correctly to reflect the actual proportion of students in poverty, the average scores of U.S. students rise significantly. Instead of 14th in reading, the U.S. is fourth in reading on PISA. Instead of 25th in mathematics, the U.S. is 10th. “While there is still room for improvement, these are quite respectable showings.”

*Because of PISA’s sampling error, the conclusions expressed by politicians and pundits were “oversimplified, exaggerated, and misleading.”

Rothstein and Carnoy identify important differences and inconsistencies between PISA and TIMSS, and between these assessments and our own NAEP. Taken together, these differences should remind us of the many ways in which the assessments confuse policy and policymakers, the media and the public.

As they note in their conclusion, “it is not possible to say whether the results of any particular international test are generalizable and can support policy conclusions.”

They conclude: “We are most certain of this: To make judgments only on the basis of national average scores, on only one test, at only one point in time, without comparing trends on different tests that purport to measure the same thing, and without disaggregation by social class groups, is the worst possible choice. But, unfortunately this is how most policymakers and analysts approach the field.”

30 CommentsComments are closed.

I wrote an analysis of the report for Daily Kos the came out as soon as the embargo was lifted (I had with Rothstein’s assistance received an embargoed copy to read and analyze). It is an important report, and perhaps some might be interested in my observations, which is why I included the link

Three years ago, unpon these release of some very favorable test score information about the quality of NJ student writing in grade eight, a spokesman for Governor Christie said, it doesn’t matter, the whole system is wretched.

I suspect the “sunk cost bias” will prevent the data driven reformers from seeing anything positive here. The data driven vultures are actuallly “driven by the data they like and the research they pay for.”

Well, we continue to play into the deformers’ hand by insisting on using such completely invalid “measurements” that are PISA, NAEP, TIMSS as evidence that “America can and will be the #1 top dog nation of the world”. Until educators stand up and SHOUT the FACT of the complete invalidity of educational standards and standardized testing as proven by Noel Wilson in “Educational Standards and the Problem of Error” found at:http://epaa.asu.edu/ojs/article/view/577/700 we will be a Manx cat chasing its imaginary tail.

“Student achievement” is the term that is used to further this agenda of dicing and slicing, sorting and separating, standardizing and grading the process (and students) that should be known as teaching and learning and not training and testing.

Diane,

Have you read Wilson’s study? If so, what are your thoughts? And if you understand it, why don’t you come out completely against using these kinds (standards and standardized testing) of nefarious comparisons?

“The achievement of “the most disadvantaged U.S. adolescents has been increasing rapidly….”

Actually, I really don’t consider that good news. Bad news, in fact. It’s just going to be twisted to mean that “no-excuses”, “drill-to-kill” education “reform” is working, so we should do more of it, at least for poor black kids who obviously can’t learn the same way affluent white kids do. Ahem.

I have to agree with Duane – we’ve got to stop using the same measures that the rheephormers do to measure “student achievement”. Intensive test prep drilling does indeed “work” to do what it claims – raise test scores. How in the world that gets conflated with anything actually resembling “achievement” is beyond me.

If, as these authors assert, “it is not possible to say whether the results of any particular international test are generalizable and can support policy conclusions, ” then why is it good news that their re-analysis shows low income American kids do better than others around the world? I thought part of Diane’s world view was that we should not trust these kind of tests at all. So is it now that we should be impressed with a reanalysis that shows American low income kids do well? I think we should be using a variety of methods to assess student progress in a broad array of fields. I also distrust international comparisons – but recall Diane and others have pointed to Finland’s test scores as evidence of what should be done. Are standardized tests ok when they come to conclusions we like?

I think this information is important for us to know. It helps us have a better understanding of how the system is rigged. The more we know, the better able we are to fight the real fight. This is exactly the kind of information that helps us, as educators, speak to the data that they try to use against us. Thanks Diane, for keeping us informed.

I see two problems:
1. The MSM will headline bad news and bury good news. I don’t think Fox, CNN, or even MSNBC will feature this, in part because it’s too “technical”.
2. The MSM’s insistence on using tests as the sole measure of educational quality forces all writers and bloggers to “defend” their positions based on these godforsaken metrics.

As a person who writes a weekly 500 word column that appears in newspapers reaching up to 650,000 families a week, and having worked with radio, tv and newspaper journalists for more than 40 years, I’d like to comment on MSM (I think you mean main stream media – its that correct?)
Reporters & columnists vary widely (as do people in almost any field. Generally the main stream media won’t carry a story about a re-analysis of someone’s research. This includes original research carried out by Rothstein, or anyone else. Lots more to say on this if anyone is interested. Here’s a link to the columns I write. http://hometownsource.com/tag/joe-nathan/?category=columns-opinion

It has not worked out for the anti-standardized testing movement to claim that the tests are not valid. Monty Neil (bless his heart) has been saying that for decades with little impact on the public discussion on assessment. It is much more effective, and kind of ju jitsu, to use the data against the people who are promoting testing and more testing. It does not undermine the argument that the tests are invalid to say so. It is like saying, “using your own instruments you don’t show what you claim to show.”

Another point is that a possible additional bias concerns the fact that in Europe many students move into vocational tracks from 14 to 16. Comparing knowledge at graduation (or really just test results but as of above people think that means knowledge) is even further distorted.

In his words: “What in heck do Carnoy and Rothstein actually mean when they speak relentlessly about social class? Answer: A person’s social class is determined by the number of books 15-year old students estimate are in their home. To repeat: The only adjustment for social class used in this study is the student estimate of the number of books in his or her home.”

Needless to say, this measure is NOT valid to compare the poor vs. rich in different countries. In some countries, people have many more books in their homes than in the U.S. That does not mean those countries are wealthier than the U.S. in any way whatsoever.

Peterson makes this point w/r/t Korea:

“Note that only 14 percent of Korean students come from few-book homes, as compared to 38 percent of U. S. students. Note that 31 percent of Korean students come from homes with many books, while only 18 percent of American students do. To Carnoy and Rothstein these data show that the lower class is nearly three times as large in the United States as in Korea. Even more bizarre, they want us to think the Korean upper class is nearly twice as big as that of the United States. If that is correct, one must expect a major migration from the United States to Korea.”

With the easy twitter link at the end of this post, I wonder what would happen if we encouraged everyone to tweet this to both Arne Duncan and Barach Obama. Granted, they don’t read their own tweets, but wouldn’t their social media team report utterly amazing numbers on this post? maybe one or the other of them would read it.

Sandy, what do you think of the Peterson critique? What do you think of the argument if people don’t believe in the value of standardized test scores, they should not use those test scores to make a point when the scores may support their position? (For what it’s worth, I’ve encouraged schools for more than 40 years to have broader approaches to evaluation than just test scores. And I have been an inner city public school teacher, administrator, parent and PTA president.

Back when I was teaching government in Maryland and there was a required test in Government as a prerequisite for high school graduation, I challenged the state legislature, the state school board, and all local school boards to take the state test in government and publish their scores. Now, the test was bad – questions with no correct answer as phrased and more than one correct answer. But my challenge was this – if any member of any of those bodies failed the one test that could be reasonably claimed to be directly relevant to their positions (unlike English, Biology or Algebra) they should do one of two things – immediately resign the position as unqualified or admit that to hold high school students to a standard they themselves could not meet which at least had some facial connection to their responsibilities was improper and abusive.

Funny, I never get any takers.

So I offered an alternative – take one of my tests on the portion of the course dealing with local and state government and see if they could pass, same deal. Also no takers.

What we do with tests is abusive, because what we are testing is often irrelevant for any purposes except being able to shame or abuse. And that is if the tests and the questions were in fact reliable instruments that allowed us to draw valid inferences. At least in the case of the Maryland High School Assessments on Local, State and National Governmnt, they were not.

Diane, what are the consequences of failing a final exam at Columbia University? Are the consequences receiving a low grade or not receiving credit for a course? These have been the consequences of failing at several universities where I have taught. But I don’t know for certain how things work at Columbia. Ken, when you write “what we do with tests is abusive,” who is the “we” in the sentence? Is it teachers? school boards? State Departments? News media? Legislators? Also, are you opposed to students having to pass any tests to pass a class or graduate? I’m asking Diane and Ken because I want to understand your views on the use of tests.

We as a society are abusive in how we are using tests. When we use test designed to allow us to draw valid inferences about what students know at that purpose, that does not mean those tests should be used to draw inferences about what the students have learned or how effective either the teachers or schools have been. The joint statements of the three major professional organizations involved with testing – APA, AERA, NCME – have made that clear for years.

When the quality of the tests is poor, and when the response to such criticisms is to not allow the content to be disclosed to cover up the problems with the tests, that is abusive.

When we drop music and art and phys ed and recess and drill so that kids can pass poorly designed tests in math and reading whose demands are often age inappropriate, ignoring what we know about human growth and development, that is abusive.

One needs to demonstrate competence for completion. Tests are often a poor way of demonstrating that competence. Properly designed performance assessments are often far better indicators.

Our use of tests in schools may be imposed in the name of improving learning, but that is not what is happening. Since the imposition of NCLB the readiness of students going to colleges, universities, even community colleges has gone and the amount of remediation necessary has been going up. Yes, I know correlation is not causation, but this is far more than mere correlation in time.

Oh, and Joe? The kinds of exams one takes at better colleges and universities are NOT full of multiple choice items, but instead require performance – essays, or research papers, and the like.

Yes, we take comprehensive examination showing we know the field. When I qualified to become a doctoral candidate I had to pass such an examination. In each case, I had a choice among at least two and in some cases three questions to answer. All of the questions required essays and application of information.

When I took my comprehensive exams at Haverford for my double major in music history and music theory, I did not see a single multiple choice question. For theory I was given several hours to do two compositional tasks – one was a four-voice fugal exposition of a theme, and I am somewhat vague on the other – it was 40 years go. For Music history there were 5 periods of music, on each there 6 or more choices, and we had to choose one for each period. Far different from the kinds of questions my 10th graders had to take in Maryland.

I said there were bad questions. When you requesting an answer that is technically wrong, such as saying Brown V Board overturned Plessy v Ferguson (read the opinion, it did not, because that was not necessary to reach the decision, even though Warren made clear the Court was prepared to do so) you have a bad or at least a sloppy question. Or consider this, the sample question from when the tests were created

If Congress were to admit a 51st state, what would be the impact upon representation in Congress
A. The House and Senate would stay the same size.
B. The House would increase, the Senate would stay the same size.
C. The Senate would increase, the House would stay the same size.
D. The House and Senate would both increase.

The State of Maryland insists that the answer is C. In the long term, that might be correct, but in the short term the correct answer is D, as those of us who remember the admission of Alaska and Hawaii should remember. OF Course the admission of a state requires adding two to the US Senate. But constitutionally no state may be admitted without at least one Representative and no Representative may be taken from one state and given to another except as the result of reapportionment subsequent to the decennial census.

The size of the House is set by legislation at 435.

When Alaska and Hawaii were admitted, it expanded until after the elections in 1962 to 437.

If you doubt that, go look at the electoral vote for the 1960 election. Kennedy won 303, Nixon 219 and Harry F. Byrd 15. If you add that up, it is 537, for 100 Senators and 437 Congressmen.

I have just scratched the surface with what is wrong with how we – The United States of America – misuse tests and abuse our children, our teachers and our public schools.

Thanks for your response, Ken. Sorry, my work responsibilities not permit me to respond at such great length. As I understand your viewpoint, in brief, we as a society “misuse tests and abuse children, teacher and public schools.” You are ok with essay tests of the kind some universities and some public schools use, as a way to measure what students know. It sounds like you think that students should have to demonstrate certain skills and knowledge in order to graduate. I helped design and implement such a system at a k-12 public school beginning in 1972. There was a math test that was multiple choice. All other measures were applied performance. I wrote about this some years ago in Phi Delta Kappan.

Having worked for 42 years in and around public schools, I think we agree that we should be using multiple forms of measurement. I am excited about the work that is being done by Ann Cook at Julia Richman and other innovative NY City public schools. Are you familiar with that work?

As some readers of this blog know, I have and am workomg with and promote strong district & charter public schools. We have worked for example with the Cincinnati District Public Schools and the Cincy Federation of Teachers. This culminated in the district eliminating the hs graduation gap between African American and white students.

Joe Nathan points to a dilemma — people, such as myself, who don’t want schools to use tests nearly so much as they do, still want to debunk that attacks on schools that see poor results on tests as conclusive evidence.

The problem is that when we see something like the results for the 2011 PIRLS and TIMSS that came out recently, or the Rothstein-Carnoy study, we start talking about the tests in the same way that the people who use them to attack public education do. Those discussions make the tests seem a legitimate way to gauge success.

I just finished a book on this (Respect for Teachers or The Rhetoric Gap and How Research on Schools is Laying the Ground for New Business Models in Education — still in Amazon’s Top 3 Million!) and I had this problem the entire time.

Clearly, when high profile people –or, really, just about any group of people — misrepresent the results of US students, it needs to be countered, but too much reliance on test data makes the use of such data seem legitimate.

Nonetheless, that misses the main point: the effect of testing on teaching. Testing is not a neutral measure. It changes the student-teacher and teacher-administration dynamics. I argue, as do many others, that it does so for the worse — that testing, particularly high stakes testing — is counter-productive, curtailing creativity and deep thinking.

By the way, when I write ‘testing,’ please read ‘standardized testing.’

Additional Note 1:
Paul Peterson’s Education Next response to the Carnoy-Martin study is a good example of how the dilemma plays out. Peterson questions the study’s proxy for social class and thus the inferences we should make from the statistics.

According to Peterson, a “social class is determined [the Carnoy-Martin study[ by the number of books 15-year old students estimate are in their home.” Okay, maybe not the best proxy, but probably not as bad as Peterson indicates. But in debating that, the real question is deliberately avoided and eventually lost.

Poverty is the real question, however, and childhood poverty rates are higher in the US. UNICEF, using a relative childhood poverty rate puts the US at 23.1%. Only Romania has a higher rate among developed countries (and I didn’t even know it was a developed country.) The US rate is about double that of the UK and Japan, nearly 3xs higher than Germany and France, nearly 4xs higher than Norway and Finland.

And relative poverty is a real thing. We group children together by social class, and the lowest 20% on the socio-economic have children who generally go to schools in areas of concentrated poverty. And, yes, it is really, really, important whether there are books at home of whether these kids have been read to.

This is especially important for tests, because what testing
really tells you is what sort of home the kids come from.

SES and level of parents’ education are by far the best predictors of how
kids do on tests. The school itself is basically an intervening variable.
It has its effects, but they are mainly due to who is grouped together in the school.

When you hear people talking about ‘good schools,’ they are usually talking about a district that is more affluent. Not always, but a disproportionate amount of the time. The other thing they may be talking about is a place where parents support the schools and generally make sure their kids are ready to start school. These factors are highly correlated with income and level of education.

Of course, I’m not saying there aren’t better and worse schools, but the ability
of the school to function well is also closely tied to parental support and the
readiness of kids to take on new tasks. The school, I think, would be the third, but
the peer group your child has and the parents who support your child’s peers are
the two most important factors. And, by and large, more affluent and better educated parents have children who are better students. They also usually have more books in their home.

Also, as for whether people don’t receive course credit after failing a final at Columbia, they probably don’t — having a couple of degrees from there, I would be surprised if they do. But whatever the answer to that question, Columbia is not something on which to model public education. Public education is, or should be, free, universal and egalitarian, Columbia is expensive beyond all belief, narrowly selective and elitist. Public education needs to reach out to all Americans and all those residing in the US; Columbia seeks out and selects a group of the highest achieving students in the world, American and otherwise.

Columbia finds the best and brightest and that is where their reputation comes from. My experience told me that they are not better at teaching — my undergraduate school was much better. Columbia professors, not all, but a lot, seemed to think you were lucky to be in their class.

By the way, Paul Peterson’s vision of virtual schools would be a lot like Columbia, selecting out if not the always the best, at least the non-problematic students.

A clear indicator of poverty and its effects are the breakdowns by schools based on the percentage of students eligible for free or reduced-price lunch (FRPL). This is a good substitute for income by school, as it is a proxy measure for the concentration of low-income students within a school. In 2006-07, 20.3 million students (41.20%) were FRPL eligible, a figure that went up after 2008.

Looking at the Reading component of The Programme for International Student Assessment (PISA) 2009, the scale is from 0 to 1000, but the scores cluster from 425 to 539 . Among developed countries, Korea was first at 539, Mexico last at 425.

In the U.S. the Average Score was 500, but this varied widely by school; on average schools with less than 10 percent FRPL eligible students scored 551, 10 to 24.9 percent 527, 25 to 49.9 percent 502, 50 to 74.9 percent 471, 75 percent or more 446. [See Note 1]

So, once sorted in these social categories, what do these scores most likely indicate? That we are coddling the young by giving children a false sense of security or that questions of poverty loom large, but largely unaddressed? That our system of education is failing or that it is highly stratified?

Socio-economic disparities, moreover, are highly correlated with racial disparities: “On the combined reading literacy scale, White (non-Hispanic) and Asian (non-Hispanic) students had higher average scores (525 and 541, respectively) than the overall OECD and U.S. average scores,” and are comparable with the first 5 countries, Korea (539), Finland (536), Canada (524) New Zealand (521) and Japan (520); “Black (non-Hispanic) and Hispanic students had lower average scores (441 and 466, respectively).” which compare to the bottom 2 among the OECD, Chile (449) and Mexico (425).

In the US, it is as if global divisions between North and South are found in microcosm.

=====================================================
[Note 1] See H.L. Fleischman, P.J. Hopstock, M.P. Pelczar and B.E. Shelley, “Highlights From PISA 2009: Performance of U.S. 15-Year-Old Students in Reading, Mathematics, and Science Literacy in an International Context,” U.S. DOE, National Center for Education Statistics, Washington, DC, 2010; http://nces.ed.gov/pubs2011/2011004.pdf