Posts categorized "Mass media"

The mass media continues to gloss over the imprecision of machines/algorithms.

Here is another example I came across the other day. In conversation, the name Martin Van Buren popped up. I was curious about this eighth President of the United States.

What caught my eye in the following Google search result (right panel) is his height:

Mr. Van Buren was very short, only 5 feet tall. I was about to spin a story about defying the conventional wisdom that the success of men is correlated with height (Wikipedia's entry on human height even has a sub-section on its correlations with occupational success).

Then a noise in my head led me to click on the White House link. And the first thing on Van Buren's biography concerns his height.

Except the White House website tells readers he was 5 feet 6 inches. Those 6 inches are a world of difference!

I'm assuming the White House number is official, and the machines at Google got it wrong (same mistake at Yahoo! and Bing; you wonder what source they are all using, or just copying each other). Not sure why these search engines ignore what would seem to be the authoritative source.

This calls for machines fact-checking machines. How to make that happen?

***

The error is not trivial! The standard deviation of height in U.S. males is about 3 inches according to Wikipedia. An error of 6 inches is two times the standard deviation. The range spanned by two standard deviations around the average covers two-thirds of U.S. men!

Wikipedia also advises that human growth hormone treatment is recommended for anyone whose height is 2.25 times or more below the population average so the machines thought Mr. Van Buren should be thus treated.

A recent article in USA Today is titled “Many with sudden cardiac arrest had early signs” (link). The signs include shortness of breath, faintness, chest pain, etc. Hold on to the headline because it’s the only thing believable in the entire article.

The words “early signs” imply to readers that were the men to heed these warnings, they could have prevented the cardiac arrests.

Think about the following two statements:

A1) Many with sudden cardiac arrest previously had symptoms.

B1) Many with symptoms subsequently had sudden cardiac arrest.

These two statements are far from equivalent, even though they describe the same sequence of events.

It’s easier to see the difference if we specify the symptoms:

A2) Many with sudden cardiac arrest had shortness of breath weeks before.

B2) Many with shortness of breath had sudden cardiac arrest weeks later.

It’s even easier to see if we include a number:

A3) 53% of those with sudden cardiac arrest had chest pain, shortness of breath, etc. (a direct quote from the article)

B3) 53% of those with chest pain, shortness of breath, etc. subsequently had sudden cardiac arrest.

B3) is clearly false. The universe of men who suffer from chest pain, shortness of breath, etc. is much larger than the population who have sudden cardiac arrest in any given week. B3) vastly exaggerates the number of sudden cardiac arrests.

***

How did the researchers come to make this type of claims? They were looking at a data set that had no control group.

“Clugh and colleagues studied medical records of 567 men from Portland, Ore., ages 35 to 65, who had out-of-hospital cardiac arrests between 2002 and 2012… 13% had [prior] shortness of breath… ” We have no way of interpreting whether 13% is a big or small number unless we know what proportion of middle-aged men with the same characteristics as those in the study but who did not have cardiac arrest suffered from shortness of breath.

One of the greatest challenges of the Big Data era is the absence of control groups. Without them, we don’t have a yardstick to judge.

In the first chapter of my first book, Numbers Rule Your World (link), I explored the concept of variability using a pair of examples, one of which was Disney's FastPass virtual reservation system. Truly grasping the ins and outs of variability is one of the most important objectives for a budding statistician (or data scientist). In the discussion, I highlighted the work of Len Testa, whose website, TouringPlans.com,
provides custom, computer-optimized itineraries for saving time in
Disney's theme parks. Testa's team does exemplary work in applying mathematical models to solve a practical problem. I'm glad to present an interview with Testa today.

KF: The problem of hitting a sequence of destinations in the fewest steps has a long history. Lots of people have worked on it. The most famous of these problems, the Traveling Salesman, even gets onto the mainstream press. However, most of this work is highly theoretical. Your touring plans to me are a shining example of amazing applied work that makes a lot of people happier. How does your work differ from the others?

LT:

A lot of times when you're trying to analyze data to solve a
particular problem, you can approach it either from the perspective of
"management" - the people controlling the process - or from the side of
the "consumer."

The thing we try to model is optimal movement
through a theme park. That is, if you're a customer and you want to ride
10 attractions, in what order should you visit them to minimize your
wait in line?

The first time we approached this problem, we
tried to figure out all of the things you need to know if you're running
a theme park: ride speed, how many vehicles to have on the track, when
to schedule entertainment to draw people to other parts of the park, and
so on. It was complicated.

Then we looked at it from the point of view of the
consumer. Consumers, it turns out, have a lot less information about
how a theme park is run. About the only thing they really have is the
posted wait time at every attraction. But it turns out that the posted
wait time is really a synthesis of all the small decisions a theme park
manager makes, so that's all you need. It's also a lot simpler to
model.

KF: That is a really great answer. I hope all the budding data analysts out there are listening. Simplicity is a beauty.

***

KF: What is your pet peeve with published data interpretations?

LT:

Lack of context, especially around economic or political analyses. Yeah,
an $860 billion stimulus package sounded like a lot of money in
2008. But in relation to a $15 trillion economy, it's what – 6%?
But all of the discourse was on the raw number, not its size
relative to the economy.

***

KF: Do you have other tips for doing great applied data work?

LT:

I find reasoning by analogy to be a powerful way to
understand and explain things. For example, if you're putting off a
flight to Europe because you're afraid of a plane crash, which is a
1-in-500,000 chance, then why would you ever drive to work, where
the odds of dying are an order of magnitude worse? So a lot of it
is "If I'm willing to accept X then I should be willing to accept
Y” type of thing.

Another
helpful thing is being able to apply Bayes Theorem, especially when
you're trying to make a business case for something. I remember one
time we were trying to get funding to re-do some computer system (at
another job), and we calculated a probability of 80% that if we made
these changes, we'd succeed in reducing customer problems and lower
future operating costs. Some people looked at that as a 1-in-5 chance
of failure. I pointed out that we were making decisions every day
with a lot less than a documented 80% chance of success.

Closer to home, I like the mix of statisticians we have now
at TouringPlans.com. They have different styles they use to
approach problems, so it's useful to hear two different views. And
when they agree, you know you've got a decent shot of being right.

KF: Statisticians disagree? They don't know the truth. Readers: you heard it on this blog first! Len, thank you so much for your time.

Doping experts have long known that drug tests catch only a tiny
fraction of the athletes who use banned substances because athletes are
constantly finding new drugs and techniques to evade detection.

It would be nice if he acknowledges the failure of sports journalists to inform the public of this piece of common knowledge ten or twenty years ago. If the false negative problem is "long known", why didn't the media report it?

Even when I was researching my first book five years ago, almost every story was about how athletes were wrongfully accused of doping, how only insufficiently talented athletes would need to dope, that a long string of negative tests threw doubt into the one positive result, how testers bungled the collection of samples thereby hurting an athlete's reputation, etc.

We also were led to believe that athletes were ingesting banned substances inadvertently because they took herbal supplements, that young, physically hyper-fit athletes had medical needs to patronize anti-aging clinics, that certain "good guys" were just curious, and somehow got ensnared on the very first and only time they tried steroids, that other "good guys" only took steroids when trying to recover from injuries, etc.

Some readers may recall the almost cultish defence of Floyd Landis after
the Amercian cyclist won the Tour de France and then got stripped of
the title. He made up some lame excuse which through the magic of sports
journalism, developed into a kind of mass delusion, the evidence of
which lives on at the Trust by Verify blog, long after Landis confessed, and took down the cycling profession while he sank. (See here for my previous coverage of the Landis circus.)

The cause-effect link in Rohan's quotation above is misleading. I won't repeat all of the arguments here--they are carefully laid out in Chapter 4 of Numbers Rule Your World (link)--but false negatives arise for a multitude of reasons. One reason for false negatives is the emergence of new drugs and techniques to evade detection. But the most important reason is that the test themselves are rigged so that it has high positive predictive value--meaning, a positive finding is highly trustworthy; and because of the inevitable tradeoff between false positives and false negatives in any diagnostic system, by making positive findings more valuable, the side effect is to make negative findings less valuable.

Such reporting perpetuates the myth of predictive models. Whenever the media reports on statistical methods used to predict things, it never ever tells us about the accuracy of such models. These reporters act as if predictions are 100% accurate, or close to it. But statistics is a science of uncertainty. Our society is overconfident about any kind of predictive models, not only for detecting dopers but also for detecting terrorists.

See Chapter 2 of Numbers Rule Your World, and Chapter 5 of Numbersense for more coverage of predictive models.

***

The past year has finally seen a public recognition of the "false negative"
problem in anti-doping testing, which I called attention to years ago. The downfall of Lance Armstrong and Alex
Rodriguez, among others, confirms that elite athletes dope, and elite
athletes can get tested hundreds of times and pass them all.

It would be appropriate if the sports journalist guild would issue a mea culpa: "Sorry readers, we have been duped."

For those who still haven't noticed, a drug test did not catch notorious superstar dopers Armstrong or Rodriguez.

Note to readers: There will be few updates to the blog in the next few days until southern Manhattan gets its power back.

Note to AT&T: If you listen to your customers, you won't need to "investigate". Cell phone service has been unavailable since Monday.

***

Lots of numbers are thrown around in the media all the time. Do you know how they are derived? This is a really important question to ask every time you see a number.

Sometimes, the source is pretty obvious. Polls and surveys provide a lot of data. Websites report a lot of numbers because every click is registered in a database. Some data are painstakingly collected by human beings, like the situational statistics collected in a baseball or football match.

What you may not realize is how unreliable some of the data sources can be. The recent New York Times article discusses a change to how much sugar Americans are consuming (link). It discloses how inexact this science is.

Some of the steps involved in deriving that number:

An average American diet is estimated

For each food item, the sugar content is estimated

For each food item, the "food loss" is estimated. The idea is that if food is bought but not consumed, the sugar is not consumed.

A Nielsen survey of food purchases is used

Interviews conducted by CDC are used

Businesses and lobbyists for the sugar industry are consulted to provide "expertise"

Academics are consulted to provide "expertise"; however, none of these professors are willing to talk about what "advise" they provided. They claimed that they could not recall. Apparently, no notes were taken by anyone at these meetings, and also they could not remember something that happened in 2008.

This enterprise sounds impressive because it is so complex. But complexity is only a good thing if we are able to measure the finer components more precisely; in this case, even at the finer level, each step involves guesswork. The error of each step then compounds. When you add in lobbyists who have an agenda, it's a grand mess.

The news here is that the Agricultural Department decided that the "food loss" estimates in the past are flawed, and by using new "methods", their estimate of sugar consumption fell by a whooping 25 percent. How can we believe this new research? To me, it is impossible to create "food loss" estimates for hundreds of food items. If you look at the report, they have food loss estimates for each of spinach, squash, sweet potatoes, snap beans, tomatoes, broccoli, etc. and the tables go on for pages.

The researchers are not conservative either. They make huge revisions to the prior estimates. The article mentioned pumpkins, which went from 20 percent lost to 69 percent lost. Yes, they claim that 70 percent of all pumpkins sold are never eaten. I mean, it's Halloween but are we really growing pumpkins just for the fun of one night?

Now these researchers told the reporter that comparing the old estimates to the new estimates is "improper because of changes in methodology". This is absurd; new methods must be compared to existing methods.

We sorely need an estimate of the margin of error of this sugar consumption number. Given that it has so many components, each of which has a wide margin of error, my suspicion is that the entire enterprise collapses by its own weight.

***

What are some ways to create a much more credible estimate? Look at the diet, and focus on the big items. Recruit a panel of people (like Nielsen does) and ask them to record their own behavior for a period of time... how much they purchased? how much they trashed? At least, we have a set of real data to work with. The sample size may be smaller but the quality of the data is much higher.

I'm sure there are many other superior methods. It's an interesting problem for statisticians to think about. For others, be careful when you're fed numbers.

I found my way to Mark Liberman's post at Language Log by way of a comment by Kyle on Andrew Gelman's post about Dubner's response to our Freakonomics article. I've always enjoyed Mark's posts and this one is no exception. His first bullet point speaks to one of my chief worries about Freakonomics-style analyses.

For background, Mark raised some doubts about recent academic work that supposedly shows that the left-right asymmetry in the QWERTY keyboard design affects our perception of words. The researchers concluded: "Words with more right-side letters were rated to be more positive, on average, than words with more left-side letters. We call this relationship the QWERTY effect."

***

Mark did some quick analyses which failed to replicate the finding. But his first point has nothing to do with replication. It is valid even if the original research has been done impeccably. Here are the words you must read:

1. The QWERTY effect's size. As far as I'm concerned, and as far as the general public is concerned, the size (and therefore the practical importance) of the QWERTY effect (if it exists) is the key question. This is not an entirely subjective matter — we can ask, as I did, what proportion of the variance in human judgments of the emotional valence of words is explained by the "right side advantage". The answer is "very little", or more precisely, around a tenth of a percent at best (at least in the modeling that I've done).

I focused on the effect-size question because the press release said the following (and the popular press took the hint):

Should parents stick to the positive side of their keyboards when picking baby names – Molly instead of Sara? Jimmy instead of Fred? According to the authors, “People responsible for naming new products, brands, and companies might do well to consider the potential advantages of consulting their keyboards and choosing the 'right' name."

So C&J may not be interested in my subjective evaluation of the effect size, but they promoted their own subjective evaluation by suggesting that the effect is important enough to matter to people choosing names. I felt (and feel) that this represents a serious exaggeration of the strength of the effect; and it seemed (and seems) appropriate to me to say so publicly.

***

Mark's complaint is similar to my response to several results championed by the Freakonomics team, including the "surname effect" as it relates to winners of the Nobel Economics Prize, and the "birthday effect" as it relates to sports leagues.

The common ingredients of such analyses are: published, peer-reviewed scholarly work that identifies an interesting effect meeting the standard of statistical significance, followed by the media's amplification and popularization of results that (a) ignore practical significance; and (b) apply a causal interpretation, possibly unknowingly.

(a) Practical significance

Statistical significance is designed to measure one thing only: how likely would we observe the effect being investigated assuming the effect does not exist (i.e., what's the chance of a false positive)? We need this concept because many observed effects (especially small effects) can happen by chance and therefore should not be attributed to the factor being studied.

Statistical significance is necessary but not sufficient for practical value. In other words, a practically meaningful effect must be statistically significant but there are many statistically significant effects that have little to no practical value.

Statistical significance will get one published in a peer-reviewed journal but it's not the job of a journal editor to discern practical value. Aside from statistical significance, the editor's other standard is contribution to the scholarship (i.e. novelty), which tells us nothing about practical value either. Thus, the "peer review" standard cannot defuse this issue.

As Mark pointed out, the practical importance of an effect is given by its effect size. For the QWERTY effect, it is one tenth of one percent at best. If you list out all of the factors that may affect "human judgments of the emotional valence of words", you will have a long list. If you now rank the factors in terms of their effect sizes, where would the QWERTY effect fall? Mark is saying the effect is one tenth of a percent at best and he's implying there are other factors more important than QWERTY at play here. Why focus everyone's attention on QWERTY when other factors may be even more important?

This is clearer in the Freakonomics examples. In the "birthday effect" research, it is acknowledged that genetic factors like gender and whether one's dad is a professional athlete are much more powerful than one's birthday. While the birthday effect is statistically significant, it is relatively small so why focus everyone's attention on that rather than on more powerful factors? As for Nobel economics prizes, one can name a variety of factors, such as educational background, creativity, intelligence, and influence that are more powerful than the first letter of one's surname.

***

Are Mark and I fussing over little details? No.

Go back to the 5% statistical significance level. By accepting this convention, we accept that 1 in 20 results are wrong. And no one knows which one is the false positive. If you have to take a guess, you would be more suspicious of the results in which the effect size is small, or when the the significance is barely achieved. Because of the 5% convention, you wouldn't be surprised that lots of published results just make the 5% mark. In addition, Andrew Gelman (link) tells us why the conventional acceptance criterion ensures that if the true effect is small, the estimated effect size will be exaggerated (and thus inaccurate).

For all these reasons, statisticians are reluctant to tout small effects that just achieve 5% statistical significance, and especially in observational data.

In some of these examples, there is no way to arrange randomized experiments to verify the reported effects. We can't run experiments on birthdays and sports success, for example. But when such experiments are possible, the results have often been ugly. To see what I'm talking about, I highly recommend reading this blog post by Gary Taubes, a long-time science reporter (thanks to John for his comment on my post on the red meat finding).

Taubes's is a really long article. Here are the two key quotes:

this meat-eating association with disease is a tiny association. Tiny. It’s not the 20-fold increased risk of lung cancer that pack-a-day smokers have compared to non-smokers. It’s a 0.2-fold increased risk — 1/100th the size...very few epidemiologists would ever take seriously an association smaller than a 3- or 4-fold increase in risk. These Harvard people [i.e., the researchers behind the red meat study] are discussing, and getting an extraordinary amount of media attention, over a 0.2-fold increased risk.

... every time in the past that these researchers had claimed that an association observed in their observational trials was a causal relationship, and that causal relationship had then been tested in experiment, the experiment had failed to confirm the causal interpretation — i.e., the folks from Harvard got it wrong. Not most times, but every time. No exception. Their batting average circa 2007, at least, was .000.

All those peer-reviewed, statistically-significant results amounted to a huge pile of misinformation... but a lot of press, which by the way does not have the habit of retracting erroneous reporting of this type.

***

To summarize: a lot of published effects are tiny, which means they have no practical value even if they have entertainment value; moreover, when reported effects are tiny, there is a good chance that they are false positives so trumpeting them for titillation can set you up for later embarrassment.

I'll address the other aspect of this practice, that of unwarranted causal interpretations, in a later post.

It's very frustrating to read journalism about health and medicine. They repeat bad advice from papers that are published with narrowly-defined research objectives.

Here's a recent example about red meat (link; the study itself). The headline gives us a prescription: "Study gives more reasons for passing on red meat". The key paragraph tells us:

Each extra serving was also tied to a 16 percent higher chance of dying from cardiovascular disease in particular, and a 10 percent chance of dying from cancer -- even after taking into account other aspects of health and lifestyle that could influence the chance of dying, such as weight, smoking, the rest of their diet and socioeconomic factors.

Every food has harms and benefits. These studies are almost always looking for "causes" of disease. Rarely are studies reported to prove that an extra ounce of red meat, etc. will enhance our intelligence by 10% (I'm making this up.) Red meat surely has some benefits. So we must balance both the benefits and the harms in order to decide how much to eat.

Complaint #2: there are multiple causes for cancer/heart disease/etc.

There is an unspoken implication that if one eats less of food X, one will not get disease Y. This is wrong on several levels. First, a 10 or 12 or 16 percent reduction in mortality rate may sound impressive but what is the base rate? The average age of 120,000 people being analyzed was about 45-50. In following them for 22-28 years, about 24,000 died of any cause, that's 20 percent. A 10 percent improvement of this probability is 2 percent. Is this practically meaningful? Only you can tell.

Another problem is that cancer/heart disease etc. has multiple causes. Most people are exposed to multiple risks that could lead to these diseases. Reducing one type of risk most likely will not stop you from getting disease Y.

Complaint #3: you can only control for things you know about

One typical defense of such analysis is the use of control variables. In the above quote, we read something like "after taking into account other aspects of health and lifestyle that could influence the chance of dying, such as weight, smoking, the rest of their diet and socioeconomic factors."

Impressive, right? The problem is for most diseases like cancer, we don't yet, and may never, know the entire causal structure of how one gets it. We can't control for variables that we don't know are related to the disease, or ones we mistakenly thought to be unrelated, or ones that are not being measured.

Secondly, these regressions almost never take into account the fact that all those control variables are not independent; for example, poorer people are more likely to have less healthy diets, someone who eats more meat are more likely to eat less vegetables. Say, someone reduces meat intake. It is likely this person would be eating more vegetables and/or fruits. The single coefficient for meat in the regression would underestimate the impact.

***

I cannot end without pointing out that it is misleading for the reporter to say "Each extra serving was also tied to a 16 percent higher chance of dying from cardiovascular disease". If this was so, with 5 extra servings, we'd all be dead. It should really say "Have one extra serving per day every day you're alive", which was what the researchers were measuring.

Every election year, journalists have a feast over the hundreds of poll results that never seem to end. They frequently abuse language as they try to explain what the polls mean. Because polls are small samples of people, poll results can only say so much. Specifically, when races are tight, they don't tell us much. This lack of clarity creates a certain nervousness among the prognosticators.

It was refreshing to see this headline: "Among Republicans, Santorum in statistical dead heat with Romney". (link) When was the last time the news tell us two candidates are neck and neck? The actual result was 30% Santorum to 28% Romney. You'd expect a headline like "Santorum slightly in front of Romney", even though the difference between them is statistically meaningless.

Some other media called this a "virtual tie". I hate this term. A "virtue tie" is a tie that really isn't. Alternatively, a "virtual tie" is nearly a tie. Neither sense of the word is applicable here. It is a tie, nothing "virtual" about it. In fact, when the Washington Post prints "Polls show Rick Santorum virtually tied with Mitt Romney nationally" (link), it gives the impression that Mitt Romney is slightly ahead... that story certainly did not come from the Pew poll. By the way, the Post manages to print this article and mention two recent polls without actually citing any numbers!

***

I just have to re-print the following table from Pew's press release here. Reporters frequently ignore the margin of error (again, this Yahoo reporter did well to feature the margin of error in the 2nd paragraph). Because these are polls with very few respondents, the margin of error is plus/minus 5 percentage points or more (for any subgroup being analyzed).

The 30% to 28% comparison was made for respondents who are "Republican or leaning Republican registered voters". This means the margin of error is given by the fourth line up from the bottom, which is plus/minus 5 percentage points. This means that the poll can only tease apart differences of larger than 10 percentage points. Notice that for any result concerning Republican voters (excluding those who said they are leaning Republican), the margin of error is even larger, at plus/minus 6 percent.

Yes, if you think these polls are useless to measure tight races, you'd be right.

Thanks to @jeanniecrowley for pointing me to Rocky Agrawal's wonderful piece on how Google - and by extension, many hyped-up tech companies - abuses statistics to deceive the public. (link here).

The post is worth reading in full. The highlight for me is this bit:

In July, [Google CEO] Page claimed that the service had 10 million users who shared 1 billion items a day. That sounds incredibly impressive. But let’s do the math. That would mean that the average user was sharing 100 items a day...

So how did we get to that number? Well, it turns out Google was counting every potential recipient of that message. A single message from Scoble today would count 240,000 times toward that number. That’s preposterous.

There are many cases similar to this one where one can easily spot mischief just by taking a skeptical position. It's a simple division to get to an average of 100 items per user per day. If one knows a little bit of statistical thinking (say, stuff from Chapter 3 of Numbers Rule Your World), one quickly realizes that not all 10 million users can be like the "average user"... in fact, with so many users, there would be lots of inactive users, who would share zero items per day. For every non-sharer, there must be some active user who shares 200 items per day to keep the average at 100.

What Agrawal is arguing is that even the maximum sharer ("Scoble") probably didn't share 100 items a day, and the maximum can't be smaller than the average.

***

I like to call these "true lies". Under certain assumptions, exclusions and definitions, one can certainly justify that these statistics are "true" but the effect on consumers of such statistics is to mislead. Frequently, one needs to examine a set of statistics side-by-side to fully understand the data.

Most importantly, one should start with why we are computing the statistic.Take the Facebook "Like" button, which is touted in a lot of places as a measure of marketing success. I just went over to the McDonald's U.S. Facebook page. It shows about 14 million Likes. What does that statistic mean? Since Facebook reports about 150 million U.S. users, does this mean only less than 10% favorability? Who are these people who "like" McDonald's? Does this reflect the success of social-media marketing in gaining more fans? Or are these 14 million hard-core fans who have always loved McDonald's and now they take the opportunity to advertise their love to the world? Do we expect the act of "Liking" to generate additional revenues for McDonald's? If so, how so? What a surprise - the number of "Likes" is insufficient to answer any of these questions.

Our mass media is a hysteria machine. In Chapter 5 of Numbers Rule Your World, I looked at the news coverage in the aftermath of jet crashes. What's happening in the wake of the cruise ship grounding in Italy is following the same pattern.

***

Here is an article talking about "10 other horrifying cruise ship disasters". (link) This is just like tables of the 10 worst airline disasters in history. The #1 on the list is unsurprisingly the Titanic. That's really a laughable comparison: 1500 dead and 700 survivors versus 4000 survivors and 10-30 dead. Unless the point of this article is to tell us how much safer travel is today compared to a century ago.

***

Here's an industry insider advising readers "how to pick a cruise line for safety". (link) Given the scarcity of accidents (see previous post), it would seem pointless to spend a lot of time choosing between cruise lines specifically for safety.