Posts categorized "Education"

Over at the McGraw-Hill blog, I wrote about how to consume Big Data (link), which is the core theme of my new book. In that piece, I highlight two recent instances in which bloggers demonstrated numbersense in vetting other people's data analyses. (Since the McGraw-Hill link is not working as I'm writing this, I placed a copy of the post here in case you need it.)

Below is a detailed dissection of Zoë Harcombe's work.

***

Eating red meat makes us die sooner! Zoë Harcombe didn’t think so.

In
March, 2013, nutritional epidemiologists from Harvard University circulated new
research linking red meat consumption with increased risk of death. All major
mass media outlets ran the story, with headlines such as “Risks: More Red Meat,
More Mortality.” (link) This
high-class treatment is typical, given Harvard’s brand, the reputation of the
research team, and the pending publication in a peer-reviewed journal. Readers
are told that the finding came from large studies with hundreds of thousands of
subjects, and that the researchers “controlled for” other potential causes of
death

Zoë Harcombe, an author of books on obesity, was one
of the readers who did not buy the story. She heard that noise in her head when
she reviewed the Harvard study. In a blog post, titled “Red meat &
Mortality & the Usual Bad Science,” (link)
Harcombe outlined how she determined the research was junk science.

How
did Harcombe do this?

Alarm
bells rang in her head because she has seen similar studies in which
researchers commit what I call “causation creep.”(link)

She then
reviewed the two studies used by the Harvard researchers, looking especially
for the precise definition of meat consumption, the key explanatory variable. She
discovered that the data came from dietary questionnaires administered every
four years (this meant subjects who didn’t answer this question would have been
dropped from the analysis). All subjects were divided into five equal-sized
groups (quintiles) based on the amount of red meat consumption. Surprisingly,
“unprocessed red meat” included pork, hamburgers, beef wraps, lamb curry and so on. This part is
checking off the box; it didn’t reveal anything too worrisome.

Harcombe
suspected that the Harvard study does not prove causation but she needed more
than just a hunch. She found plenty of ammunition in Table 1 of the paper. There,
she learned that the cohort of people who report eating more red meat also
report higher levels of unhealthy behaviors, including more smoking, more
drinking, and less exercise. For example,

The
researchers argue that their multivariate regression analysis “controlled for”
these other known factors. But Harcombe understands that when effects are
confounded, it is almost impossible to disentangle them. For instance, if you're comparing two school districts, and one is in a really rich neighborhood, and the other in a poor neighborhood, then race and income will be confounded and there is no way to know if the difference in educational outcomes is due to income or due to race.

Next,
Harcombe looked for data to help her interpret the researchers’ central claim:

Unprocessed and processed red meat intakes
were associated with an increased risk of total, CVD, and cancer mortality in
men and women in the age-adjusted and fully adjusted models. When treating red
meat intake as a continuous variable, the elevated risk of total mortality in
the pooled analysis for a 1-serving-per-day increase was 12% for total red
meat, 13% for unprocessed red meat, and 20% for processed red meat.

Her
first inquiry was about the baseline mortality rate, which was 0.81%. Twenty
percent of that is 0.16% so roughly speaking, if you decide to take an extra
serving of processed red meat every day, you face a less-than-2-out-of-1000
chance of earlier death. (Whether the earlier death is due to the red meat or
just more food consumed each day is another instance of confounding.)

This
also raises the issue of error bars. As Gary Taubes explained in his response
to the red-meat study (link), serious epidemiologists only pay attention to
effects of 300% or higher, acknowledging the limitations of the types of data
being analyzed. The 12- or 20-percent effect does not give much confidence.

The
researchers are overly confident in the statistical models used to analyze the
data, Harcombe soon learned. She was able to find the raw data, allowing her to
compare them with the statistically adjusted data. Here is one of her
calculations.

The
five columns represent quintiles of red meat consumption from lowest (Q1) to
highest (Q5). The last row (“Multivariate”) is the adjusted death rates with Q1
set to 1.00. The row labelled “Death Rate(Z)” is a simple calculation performed
by Harcombe, without adjustment. The key insight is that the shape of
Harcombe’s line is U-shaped while the shape of the multivariate line is
monotonic increasing.

The
purpose of this analysis is not to
debunk the research. What Harcombe did here is delineating where the data end,
and where the model assumptions take over. One of the themes in Numbersense is that every analysis combines
data with theory. Knowing which is which is half the battle.

At
the end of Harcombe’s piece, she checked the incentives of the researchers.

Harcombe
did really impressive work here, and her blog post is highly instructive of how
to analyze data analysis. Chapter 2 of Numbersense
looks at the quality of data analyses of the obesity crisis.

For years, I have wanted to see a statistics course that is not a math class. So I made one myself. The title of the course is "How to do statistics without really doing statistics?". It's on a new online learning platform called Three Nights and Done. There are three hours worth of materials divided into three or four chunks each hour.

PS. Errata. I have requested these be fixed. But until that time, please note:

Night 2 - Part 3: at the start of the discussion of the Monty Hall problem (roughly 6:56), there is a duplicated segment which you will hear again later in the same recording. This can be very confusing to those who don't know what the Monty Hall problem is. You should skip from 6:56 to 8:53 (redundant), then the flow is maintained.

Night 3 - Part 1: the introductory comments to Night 3 went missing, and will be added back. Here is the transcript: Class 3 is based on my new book Numbersense. The premise of the book is that the world of Big Data is very confusing, and filled with claims and counterclaims. A key skill is how to analyze and interpret other people's data analyses. The book gets at different aspects of Numbersense. In Class 3, I focus on two fundamental ones: knowing how data is collected, and knowing how data is processed. These two sound simple; they are anything but. Data collection and processing is a very messy world, and very arbitrary as well. However, these are two really important things to know. If you come across a study that does not disclose details of how data was collected and processed, you should be highly skeptical about the results.

Night 3 - Part 3: start listening at 12:53. the segment before 12:53 is an exact copy of the end of Part 2.

Traditional Coast Guard boot camp when I was a grunt in the 1960s was twelve weeks, versus the Army's nine-week stint. This was because people who came into the Guard were usually in suboptimal physical condition and because we had more instructional classes, such as semaphore and maritime law.

(From Lee Gutkind's book on creative non-fiction, You Can't Make This Stuff Up)

If some study found that Coast Guard graduates are in worse physical shape than Army graduates, what is the cause?

I was a guest on the Analytically Speaking series, organized by JMP. In this webcast (link, registration required), I talk about the coexistence of data science and statistics, why my blog is called "Junk Charts", what I look for in an analytics team, the tension between visualization and machine algorithms, two modes of statistical modeling, and other things analytical.

The New York Times wrote about how the "Big Data" industry is trying to transform education (link). This is amusing and creepy by turns.

All of these may be well-intentioned, but what strikes me is how unscientific the arguments are given in favor of these data-driven methods. You'd expect the same data-driven approach to be used to justify their new solutions but you find almost none of that.

***

For example:

Arizona State’s initial results look promising. Of the more than 2,000 students who took the Knewton-based remedial course this past year, 75 percent completed it, up from an average of 64 percent in recent years.

What does it mean by "completing" the course? Is completion the same as competence? How do we know that the students were comparable from year to year? What is the variability of completion rates from year to year? Were there any changes in the admission rules or criteria for completion in the course? Were there any changes in the contents of the course?

Where is the control group? Andrew Gelman has written a number of times about experimentation in education. It would seem like companies like Knewton should take the lead in this type of evidence-gathering.

***

Elsewhere:

Mr. Lange and his colleagues had found that by the eighth day of class they could predict, with 70 percent accuracy, whether a student would score a “C” or better.

I don't know what the distribution of grades is at this school (Rio Salado) but grade inflation in US colleges has generally moved most if not all grades to "C" or better so I'd consider a 70 percent accuracy predicting "C" or above to be poor. Also, the issue is not whether one can diagnose the cases but whether there is a solution that would improve the underperforming students' grades. That would depend on the reason for underperforming. There will be cases where the students deserve "C" or worse.

Reading the article, I feel like much more deep thinking is needed to figure out why we would want to change in these ways.

***

Change is not always good. I have been teaching a course at NYU for many years. About two or three years ago, the course evaluation form went online. It used to be that I'd dedicate 15 minutes of the last class handing out evaluation forms, leave the classroom, and dedicate a student to collect the forms and drop them in the mail. Now, students are reminded by email towards the end of the semester to fill out an online survey.

Not surprisingly, the number of students responding has plunged. It was almost 100% when it was filled in during class. Now it's rarely above 30%. In order to encourage higher response rates, the emails that go out to students (and faculty) have become more frequent and they start earlier and earlier in the semester. The first email that opens the survey window is now sent not long after the midpoint of the course. As a result, students can comment on the class based on half or two-thirds of the experience.

The nature of responses has also changed. I now see mostly extreme opinions. The people who care to write evaluations either love you or hate you. (The irony is that all students think they deserve an A, a standard they don't apply when evaluating professors.) Students who are in the middle don't bother to give feedback.

It is absolutely true that putting the form online is more efficient, saves class time, and creates a data source for future data mining but the quality of the data has drastically declined.

***These all go back to the issue of measuring intangible things. It's very difficult to do right. See my related post here.

The following throw-away lines in a Wall Street Journal article about the "return on investment" of getting into college debt (what an idea) are the most important ones:

The report [by the College Board] also doesn't account for dropouts or extra college years. Only 56% of students who enroll in a four-year college earn a bachelor's degree within six years, according to a report last year by the Harvard Graduate School of Education...

PayScale, a Seattle data firm, examines the links between pay and variables like colleges and majors. Its analysis, which also ignores dropouts but accounts for students who take longer to complete their degrees, ...

I cut that off since I've heard enough. How can they get away with ignoring dropouts when they are assessing the return on investment of college debt?

Imagine a cohort of 10,000 students starting college on debt. By year 6, which apparently is when they stop counting, 4,400 have not graduated, either because they dropped out or they are still in school. Both of these groups are likely to have the lowest return on investment of those in the cohort. Most of the dropouts won't be getting college-graduate jobs which pay higher. Those still in school are probably troubled students who if they do graduate later, would also earn less - even if they are equally qualified, they would earn less by value of time.

Given this reality, the analyses by the College Board and by PayScale would "ignore dropouts" as if they didn't ever exist. In other words, they only look at the 5,600 not the 10,000. This means whatever return on investment they compute will be exaggerated.

***

Technically, this is an example of survivorship bias. The sample being studied does not contain "non-survivors", in this case dropouts, so it doesn't generalize properly.

Also, the data is censored in the sense that the observation window is not enough for us to know what would happen to those people who are in college longer than 6 years. This is a common feature of such data sets; you'd want to do something about it, not just ignore it.

There are in fact many other problems with this type of analysis. Here's another crucial one: the counterfactual for reasoning whether debt is the cause of higher future wellbeing is not having debt. In other words, any such analysis must tell us what would happen if the same students were able to complete college without having to incur debt. Based on what the WSJ reporter said, I don't think this is how they framed the problem.

The LA Times (link) made the following comment as it describes the shameful situation in which the Dean of Admissions at the prestigious Claremont McKenna College (#9 on US News ranking of Liberal Arts colleges) inflated the average SAT scores of incoming students in order to manipulate national college rankings:

The collective score averages often were hyped by about 10 to 20 points in sections of the SAT tests, Gann said. That is not a large increase, considering that the maximum score for each section is 800 points.

Not a large increase? Are they wilfully ignorant, or just ignorant? I hope it's not a quote from CMC President Pamela Gann but an embellishment by the reporter. When one interprets whether 10 to 20 points is a "large" increase or not, one must find the right reference distribution of scores.

The maximum score is 800 but that is for individual scores. The 10 to 20 points manipulation is of the average scores of the freshmen class (about 300 students). The distribution of individual scores is much, much more variable than the distribution of average scores. So while 10 or 20 points for an individual may not be material, shifting the average score by 10 or 20 points is fraud of a massive scale.

Let's take a rough guess. According to the College Board, the standard deviation of individual scores is about 110 points (See the footnote on "Recentering" on this page). This means the standard deviation of the average scores of samples of 300 is 6.4 (this is known as the standard error). A 10-point fraud is about 1.6 standard errors. A 20-point fraud is just over 3 standard errors.

It's easier to visualize the scale of this:

Imagine the college's true SAT score average to be at "Z Score" = 0. Think of that as the median value (50th percentile). A 10-point fraud moves the average to 1.6 on the Z-Score scale, and as the diagram shows, that is moving from 50th to 95th percentile! And according to LA Times, that is the lower bound of the alleged manipulation.

Another way to see the size of this manipulation is to look at the average SAT scores for the top colleges (I found some data here but it's from 2004.) For instance, there is only about a 10-point spread between Columbia, Penn, Duke and Rice. Even a few points will shift the SAT score rankings.

***

So, after failing ethics, maybe the College is failing statistics too.

PS. [2/1/2012] @rags and I have been discussing what value of standard deviation should be used in the standard error formula. The proper value should be the standard deviation of the SAT scores of typical freshmen at CMC (or similar schools). The number I found is the standard deviation for the entire SAT test taking population so it is an over-estimate. If you find a school-level standard deviation number, please let me know and I'll adjust the computation. I don't think the conclusion would change much though given what we see in the table of average scores by college shown above.

PS. [2/4/2012] If the standard error was over-estimated, then the distribution of average scores would be even tighter than stated. This would make a 10- or 20-point manipulation even more egregious.

I just finished Emanuel Derman's new book, "Models Behaving Badly", which is a good introduction to the philosophy of statistical models. The topic has been swirling in my head after also having read this article by economist Dani Rodrik, who reflected on the recent walkout by some Harvard students of their introductory economics course.

***

In Rodrik's view, the students were right to protest the economics profession because the economic models being taught in the classroom are too simplistic. He paints a particularly eye-opening - and damning - scenario: in the undergrad classroom, as well as in public, the economist admits no doubts about his ideologies (such as "free trade", "free market") but in his "advanced graduate seminar on ... theory", the same professor would debate with skeptics, leading to a "heavily hedged statement" after "a long and tortured exegesis". The statement would begin with "if the long list of conditions I have just described are satisfied, .."

I could imagine Derman entering that graduate seminar, and declare everything as nonsense. (Derman currently teaches in the Financial Engineering program at Columbia, and previously worked on Wall Street as a "quant" building economic models, after spending his graduate career working with models of the physical world.) "Models Behaving Badly" is about how economic models can go off-track, how frequently they do, and why modelers must behave modestly. Derman would argue that Rodrik's "long list of conditions" are almost never satisfied.

Derman:

There is a crucial difference between the assumptions made by the Black-Scholes Model and the assumptions made by a souffle recipe. Our knowledge about the behavior of the stock markets is much sparser than our knowledge about how egg whites turn fluffy.

He goes on to argue, perhaps unexpectedly, that the Black-Scholes Model is "the best model in all of economics". He aims his criticism squarely at the sacred cow of financial economics, the "Efficient Market Hypothesis".

***

Rodrik does not believe that the economics profession needs better models. He claims "Macroeconomics and finance did not lack the tools needed to understand how the crisis arose and unfolded." The fault of the profession was to have trusted the wrong models (ones assuming efficient and self-correcting markets). He believes that this bad choice of models is facilitated by "excessive confidence in particular remedies - often those that best accord with their own personal ideologies."

It isn't clear to me how Rodrik proposes to resolve the ideology problem. In fact, his citation of another economist Carlos Diaz-Alejandro perfectly captures the heart of the issue: "by now [1970s] any bright graduate student, by choosing his assumptions... carefully, can produce a consistent model yielding just about any policy recommendation he favored at the start."

The diesease is more than ideological. Reading behind the lines, I think these models are far too complex for their own good. They cannot be falsified with observed data. They can be made to support any ideology. This leads me to two observations:

The forecast is dire: So long as this type of modeling persists, the choice of models will be based on ideology only

Even Rodrik's diagnosis is suspect: perhaps it is only in hindsight that one can determine which model out of that infinite universe of models is "a bad model".

***

Returning to the protesting Harvard students, Rodrik describes the discontent of the undergrad economics syllabus: "it is as if introductory physics courses assumed a world without gravity, because everything becomes so much simpler that way."

In making this analogy, Rodrik is giving economic models the status of models in physics. He's saying that there are simplified models in both disciplines which don't fit reality well, but there are complex models in both disciplines which work well.

Derman would beg to differ. Originally trained as a physicist, he now freely admits that "financial modeling is not the physics of markets". He spends a great portion of the book showing why economic models can never aspire to the status of physics models.

Reading Rodrik's analogy, one senses that he has yet to arrive at Derman's port. Rodrik continues to make parallels between physics and economics. But I know of no introductory physics course that assumes a world without gravity - the major omission is Einstein's relativity. There is, in fact, a huge difference between Newton's theory of mechanics and, say, the Capital Asset Pricing Model. Students who learn Newton's laws can explain how the world works without ever knowing any relativity theory. Newton's theory can stand on its own. Not so the simplistic economics models. As Derman points out, simple economics models are easily invalidated by observed data.

***

My own view, informed by years of building statistical models for businesses, is more sympathetic with Derman than Rodrik. There is no way that economic (by extension, social science) models can ever be similar to physics models. Derman draws the comparison in order to disparage economics models. I prefer to avoid the comparison entirely.

The insurmountable challenge of social science models, which constrains their effectiveness, is that the real drivers of human behavior are not measurable. What causes people to purchase goods, or vote for a particular candidate, or become obese, or trade stocks is some combination of desire, impulse, guilt, greed, gullibility, inattention, curiosity, etc. We can't measure any of those quantities accurately.

What modelers can measure are things like age, income, education, past purchases, objects owned, etc. Nowadays, we can log every keystroke you type on your smartphone (link). That models are even half-accurate is due to the correlation of these measured quantities with the hidden drivers of our behavior but this correlation is only partial.

At the end of this post, Felix found it remarkable that the government would not have better access to the data. The same sentiment was expressed at a recent presentation by the data team at Bundle.com, in which they described the extraneous strenuous process by which they matched the names of merchants on credit card statements to a database of known merchants. One would think the credit card companies would be able to pass along the merchant identifiers but they don't or can't.

Simon points out that while the New York Times did a fantastic job with this visualization of the European debt linkages, one should notice what wasn't present on the chart, namely, the murky world of derivatives and not knowing it denies us knowledge about the exposure of U.S. banks to this potentially devastating problem.