Posts categorized "Story time"

In the previous post, I described how some researchers found insights from a database of fatal car crashes. This dataset has all the markings of OCCAM data, which I use to summarize the characteristics of today's data.

Observational

the data come from reports of crash fatalities, rather than experiments, surveys, or other data collection methods

No Controls

the database only contains the cases, i.e. fatalities but not controls, which in this case should be drivers who did not suffer fatalities. The study design creates a type of control but as discussed in the previous post, the "controls" are still fatalities, just that they happened during different weeks. Such a study design requires the untested assumption that under normal circumstances, the frequency of fatalities to be constant within the three-week window of the study.

Seemingly Complete

it is assumed that all crashes involving fatalities are reported accurately in the database. This assumption is frequently discovered to be wrong when the analyst digs into the data. A recent example is the Tesla auto-pilot analysis: even though in theory Tesla should have data on all its vehicles, the spreadsheet contains a large number of missing values.

Adapted

the fatality data are collected for a number of uses, none of which is to investigate the potential effect of 420 Cannabis Day. Adapted data is sometimes called found data or data exhaust

Merged

For this analysis, the researchers did not merge datasets. Most of the time, they do. For example, one of the commenters suggests looking at the effect of temperature. To do that requires merging local temperature data with the fatality data. Merging data creates all kinds of potential data quality issues.

***

In this post, we shall forget about the conclusion of the previous post, that April 20 may not be extraordinary. We accept that April 20 is an unusual day.

The first question to ask is: unusual in what way?

Let's look at the histogram again:

April 20 is unusual in having a higher number of fatal car crashes compared to the average of April 13 and 27.

That is what we learned from the data. Our next question is: why is April 20 worse?

According to the original study, the reason for the excess fatalities is excess cannabis consumption on April 20 because 420 is cannabis celebration day.

But at this point, we only have story time. Story time is the spinning of grand stories based on tiny morsels of data. The moment hits you in the second half of a newspaper article or research report after the author presents the data analyses, when you realize that story-telling has begun, and the report strays far from the evidence.

In this case, it's the link between excess fatalities and excess cannabis consumption that is tenuous. The problem goes back to OCCAM data, and lack of proper controls. If we could perform an experiment, the evidence would have been interpreted more directly.

The database of fatalities does not contain data on cannabis consumption. The original study has some info on "Drug police report" with over 60 percent of the cases listed as "not tested or not reported". This information is not used to argue one way or another about cannabis consumption.

The next step for this type of study is finding corroborating evidenceto support the causal story. For example, are more of these accidents occurring around neighborhoods in which 420 Day is being celebrated? Can we find neighborhoods that only started celebrating 420 Day after a certain year and look at whether a jump in crash fatalities occurred after that year? Do people drive more or less frequently after they smoke weed? Are there proxies for cannabis consumption? (for example, maybe cannabis users are more likely to drive certain cars.) etc.

Harper and Palayew looked into whether the crash ratio got worse over time because cannabis consumption may have increased over time. They failed to see this, which weakens the conclusion.

And if you're wondering about the acronym, it's Driving Under the Influence of Weed on 420 Day, which I learned from Andrew Gelman's blog is a day of celebration of cannabis.

Andrew's blog post is about the exemplary work done by Sam Harper and Adam Palayew, debunking a highly-publicized JAMA study that claimed that 420 Day is responsible for a 12 percent increase in fatal car crashes.

The discussion provides great fodder for examining how to investigate observational data, which is what most of Big Data is about. It is a cautionary tale for what not to do.

***

The blog begins with Harper/Palayew channeling Staples/Redelmeier, the authors of the study: "fatal motor vehicle crashes increase by 12% after 4:20 pm on April 20th (an annual cannabis celebration)."

This short sentence captures the gist of the original study but it omits an important detail: to what is the increase relative?

If we ran an experiment, we would recruit a group of drivers, and select half of them at random to smoke weed on April 20. Then, we would count what proportion of drivers suffered fatal car crashes after 4:20 pm. The analysis would be straightforward: what's the difference in proportions between the two groups? With such an experiment, it is possible to draw a causal conclusion.

Alternatively, we could conduct a case-control study. The cases are the drivers who suffered fatal car crashes on April 20. We collect demographic data on these drivers. Then, we define a set of "controls", drivers who did not suffer car crashes on April 20 but on average, have the same demographic characteristics as the cases. Next, we need data on cannabis consumption, preferably on April 20. We want to show that the level of cannabis consumption is significantly higher for cases than for controls.

(For further discussion of these analysis designs, see Chapter 2 of Numbers Rule Your World (link).)

The actual study was neither experiment nor case-control. It was a piece of pure data analysis, based on "found data". I like to call this "adapted data," the "A" in my OCCAM framework for Big Data - data collected for other purposes that the researcher has adapted for his/her own objectives. In this study, the adapted data come from a database of fatal car crashes.

So how was the adapted data analyzed? Harper/Palayew answer this question in their second description of the research:

Over 25 years from 1992-2016, excess cannabis consumption after 4:20 pm on 4/20 increased fatal traffic crashes by 12% relative to fatal crashes that occurred one week before and one week after.

The cases are the fatal car crashes that occurred after 4:20 pm on 420 Day. The comparison isn't to the drivers who did not suffer crashes on the same day. The reference group consisted of fatal car crashes that occurred after 4:20 pm on 4/13 and 4/27. The difference in the average number of crashes is taken to result from "excess cannabis consumption".

Notice that such a conclusion requires a strong assumption. We must believe that absent 420 Day, 4/13, 4/20 and 4/27 ought to have the same fatal crash frequencies.

***

You hopefully recognize that the analysis design for adapted data is on much shakier ground than either an experiment or a case-control study.

Harper/Palayew's initial debunking focused on one issue: what's so special about April 20? To answer that, they repeated the same analysis on every day of the year. The following pretty chart summarizes their finding:

The red line is the line of no difference (between the analyzed day and the two reference days from the week before/after). Each vertical line is the range of estimate of the difference for a specific day of the year. The range for 4/20 is highlighted, and several other days with elevated fatal crash counts are labeled.

The chart was originally published here, with the following commentary: "There is quite a lot of noise in these daily crash rate ratios, and few that appear reliably above or below the rates +/- one week." Andrew adds: "Nothing so exciting is happening on 20 Apr, which makes sense given that total accident rates are affected by so many things, with cannabis consumption being a very small part."

While the chart looks cool, and sophisticated, the following histogram of the same data helps the reader digest the information.

I took the daily estimates of the fatal crash ratios from Harper/Palayew's published data. Each ratio presents the crashes on the analysis day relative to the crashes on the two reference days. The histogram shows the day-to-day variability of the crash ratios, which is what we need to answer the question: how special is 4/20?

The histogram is roughly centered at 1.0 meaning no observed difference. The black vertical line shows the ratio for 4/20. It is leaning right - in fact, it is at the 94th-percentile. In classical terms, this is a p-value of 0.06, barely significant.

Will JAMA editors accept one research paper for each of these days? The work is already done - the rest is story time.

P.S. [4/27/2019] Replaced the first chart with a newer version from Harper's site. This version contains the point estimates that the other version did not. Those point estimates are used to generate the histogram.

Seamless, the online restaurant delivery service, has been running a series of fun ads on the New York subway that has a statistics theme. Here is a snapshot of one of them:

The text on the ad says:

The Most Potassium-Rich Neighborhood

MURRAY HILL

Based on the Number of Banana Orders

No One’s Cramping Here

***

This ad is tongue-in-cheek. But it's making a data-driven argument. So I started unpacking it.

The conclusion is “No one’s cramping here (in Murray Hill).” It’s an exaggeration so I’m going to read this as “Most people don’t cramp here in Murray Hill.”

The data behind this conclusion is much harder to nail down. One would think it should be the proportion of orders containing bananas in Murray Hill relative to the same in other neighborhoods. The ad uses the phrase “number of banana orders.” What does that mean? Is it “orders with at least one banana”? Or “orders of bananas only”? Or “total number of bananas ordered (across all orders)”?

Between the data and the conclusion is a long, windy path. Let me draw this out:

Assumption 1All the neighborhoods have similar total populations so that by proportion of banana orders, Murray Hill also ranks #1.

Assumption 2“Banana orders” is defined meaningfully. For the sake of argument, we’ll assume a banana order is an order that contains at least one banana.

Assumption 3The data analyst used the appropriate address data. For the sake of argument, we'll assume that the delivery address is the source of the neighborhood data.

Assumption 4Everyone who has a “banana order” through Seamless lives in the neighborhood to which the banana(s) were delivered. This further requires

Assumption 5Everyone who has a “banana order” through Seamless works in the same neighborhood as they live. This distinction is important for daytime orders.

Assumption 6Murray Hill residents who has a “banana order” through Seamless are just like other Murray Hill residents

Assumption 7The name on each “banana order” is the one person who consumes the banana(s). No dogs ate the bananas, nor did a co-worker, family member, or anyone else not known to Seamless

Assumption 9Published scientific reports reach a strong consensus on the effect of bananas on cramping (highly unlikely); or, Seamless data show that those with a “banana order” report the absence of cramps (which requires primary research). The causal interpretation further requires

Assumption 10Knowing that the people who made “banana orders” through Seamless would have suffered cramps had they not ordered and consumed those bananas. This counterfactual scenario is never observed, so instead, we accept

Assumption 10bKnowing that the people who did not make a “banana order” through Seamless did suffer cramps. This requires

Assumption 11The people who live in Murray Hill and did not make a “banana order” through Seamless also did not order bananas from a different shop, or otherwise consume bananas. In addition, we require

Assumption 12No one who is part of this analysis benefited from any other anti-cramping remedy; or at the minimum,

Assumption 13That people who have “banana orders” through Seamless, and those who don’t, are equally likely to have used other forms of anti-cramping remedy

Assumption 14One banana is effective at stopping cramps, meaning there is no dose-response effect, the presence of which would require us to define “banana order” differently under Assumption 2.

The above assumptions fall into three groups: obviously false (e.g. Assumption 1); possibly true; and most likely true. The probability of the conclusion depends on the probabilities of these individual assumptions.

***

tl;dr

Most data-driven arguments consist of one part data, and many parts assumptions. An analyst should not fear making assumptions. Assumptions should be supported as much as possible.

In a mailing list I subscribe to, some users were not happy about the academic research using Dropbox data collected on academics, as written up by the co-authors in Harvard Business Review (link).

In a nutshell, the researchers obtained "anonymized" "project-folder-related" data from Dropbox on university-affiliated accounts, and did some simplistic bivariate correlations, and proceeded to draw several conclusions about "best practices" on "successful team collaborations." This type of research is very common in this "Big Data" age, and I have already written extensively about its many challenges.

OCCAM Data

This Dropbox dataset has all five descriptors of the OCCAM characteristics of "BIg Data". It is Observational, seemingly Complete, with no Controls, Adapted from its original use, and Merged (with data from Web of Science). These characteristics cause many problems with the analysis, which I describe below. For more on OCCAM data, check out this post, and my other posts on OCCAM data.

Found Data

Implicit in their analysis - and most other uses of "found data" - is the assumption of complete data. The authors believe that because their data consist of tens of thousands of researchers and 500,000 projects, they must have all the data. In this case, the authors knew that there are other platforms out there but they waved away the inconvenience. This implies they believe they have all of the "informative" data.

The authors also assumed that (a) all relevant collaborative research involves putting all relevant files on one of the major online platforms (e.g. nothing on Slack or emails) and (b) all project collaborators use one and only one platform. Further, they assume that everyone has highly organized and structured folder directories within Dropbox for which an external person who knows nothing about the projects, or a machine, can infer its contents. These problems arise because the researchers did not start with a research question and design the data collection. They chose to adapt "found data" to their own objectives.

Ecological Fallacy, and Story Time

A typical conclusion is "People at higher-performing universities seemed to share work more equally, based on the frequency of instances that collaborators accessed project folders." Top universities are ranked by the aggregate performance of all teams (that use Dropbox, and are identified correctly). It does not follow that every team at a top university is a top team.

This is an instance of "story time." A piece of data is offered about something related, then while the reader is dozing off, it is linked to a conclusion that is not directly supported by the data. That conclusion is elaborated by a lot of words hypothesizing why it must be true. In this case, they say "It’s likely that more frequent collaborations led to positive spillover of information, insights, and team dynamics from one project to another." But they provide no evidence at all for this last statement. That's just a story.

Proxy Unmasking

Here's another one of the conclusions: "People at higher-performing universities seemed to share work more equally, based on the frequency of instances that collaborators accessed project folders." They make a conclusion about work allocation between different collaborators but what they actually measured was the relative frequency at which collaborators accessed the project folders in Dropbox. That's a proxy measure, and it's convenient based on "found data", but not a good proxy.

Xyopia

It does not appear that a multiple regression model was run. The presentation walks through apparently a series of bivariate analyses. The word control does not appear in the entire article. So this work suffers from xyopia - in each analysis, the one explanatory variable being analyzed is presumed to be the chief and only variable that influences the outcome.

Causation Creep, and More Story Time

The authors made no attempt to establish causality at all. They just interpreted every correlation as causal. So every conclusion is "story time". They print one analysis of the data, then they draw a causal conclusion that one would believe only when half-asleep.

***

People are also upset about data privacy.

It does not appear that the academic users understood that using Dropbox means they are part of research studies.

People don't believe the data are truly anonymized. It's pretty clear that the anonymization can be easily reversed. Just take the HBR article for example. If they removed the names of the authors but retained information about the authors as: one junior faculty at Northwestern Univ. Business School, one senior faculty at Northwestern Univ. Business School, and one employee of Dropbox - I don't think you can find another article that fits those criteria. So is that anonoymous?

It's unclear if how they "anonymized folders" or analyzed them. There are folders with highly descriptive names, there are folders with partially descriptive names that only the collaborators may be able to decipher, and there are folders with names that do not identify the project (e.g. old_work). If they converted all folder names to alphanumeric strings, then all information about the contents of the folders is lost. If they don't convert those names, then there are clearly privacy concerns.

It's clear that some kind of IRB review is necessary to approve Big Data research projects to make sure privacy is protected.

One of my favorite statistics-related wisecracks is: the plural of anecdote is not data.

In today's world, the saying should really say: the plural of anecdote is not BIG DATA.

In class this week, we discussed a recent Letter to the Editor of top journal, New England Journal of Medicine, featuring a short analysis of weight data coming from a digital scale that, you guessed it, makes users consent to being research subjects by accepting its Terms and Conditions. (link to NEJM paper, covered by New York Times)

The "analysis" is succinctly summarized by this chart:

Their conclusion is that people gain weight around the major holidays.

How did the researchers come up with such a conclusion? They in essence took the data from the Withings scales, removed a lot of the data based on various criteria (explained in this Supplement), and plotted the average weight changes over time. Ok, ok, I hear the complaint that I'm oversimplifying. They also smoothed (and interpolated) the time series and "de-trended" the data by subtracting a "linear trend". The de-trending accomplished nothing, as evidenced by comparing the de-trended chart in the main article to the unadjusted chart in the Supplement.

Then, the researchers marked out several major holidays - New Year, Christmas, Thanksgiving (U.S.), Easter, and Golden Week (Japan) - and lo and behold, in each case the holidays coincided with a spike in weight gain, ranging from a high of about +0.8% (U.S.) to a low of +0.25% (Easter).

Each peak is an anecdote and the plural of these peaks is BIG DATA!

Why did I say that? Look for July 4th, another important holiday in the States. If this "analysis" is to be believed, July 4th is not a major holiday in the U.S. On average, people tend to lose weight (-0.1%) around Independence Day. There is also no weight change around Labor Day.

In a sense, this chart shows the power of data visualization to shape perception. Labeling those five holidays draws the reader's attention. Not labeling the other major holidays takes them out of the narrative. Part of having numbersense is to have ability and confidence to make our own judgment about the data. Once one notices the glaring problems around July 4th and Labor Day, one no longer can believe the conclusion.

There is also "story time" operating here. The researchers only had data on weight changes. They did not have, nor did they seek, data on food intake. But the whole story is about festive holidays leading to "increased intake of favorite foods" which leads to weight gain. Story time is when you lull readers with a little bit of data, and when they are dozing off, you feed them a huge dose of narrative going much beyond the data.

The real problem here relates to the research process. Traditionally, you come up with a hypothesis, and design an experiment or study to verify the hypothesis. Nowadays, you start with some found data, you look at the data, you notice some features in the data like the five peaks, you now create your hypothesis, and there really is little need to confirm since the hypothesis is suggested by the data. And yet, researchers will now run a t-test, and report p-values (in this weight change study, the p-values were < 0.005.)

Even if it's acceptable to form your hypothesis after peeking at the data, the researcher should have then formulated a regression model with all of the major holidays represented, and then the model will provide estimates of the direction and magnitude of each effect, and its statistical significance.

PS. Some will grumble that the analysis is not "big data" since it does not contain gazillion rows of data, far from it. However, almost all Big Data analyses that are done following the blueprint outlined above. Also, I do not define Big Data by its volume. Here is a primer to the OCCAM definition of Big Data. Under the OCCAM framework, the Withings scale data is observational, has no controls, is treated as "complete" by the researchers, and was collected primarily for non-research reasons.

PPS. Those p-values are hilariously tiny. The p-value is a measure of the signal-to-noise ratio in the data. The noise in this dataset is very high. In the Supplement, the researchers outlined an outlier removal procedure, in which they disclosed that the "allowable variation" is 3% daily plus an extra 0.1% for each following day between two observations. Recall the "signals" had sizes of less than 0.8%.

Theranos (v): to spin stories that appeal to data while not presenting any data

To be Theranosed is to fall for scammers who tell stories appealing to data but do not present any actual data. This is worse than story time, in which the storyteller starts out with real data but veers off mid-stream into unsubstantiated froth, hoping you and I got carried away by the narrative flow.

Theranos (n): From 2003 to 2016, a company in Palo Alto, Ca., the epicenter of venture capital, founded by Elizabeth Holmes, a 19-year-old Stanford University dropout, raised over $70 million to develop and market a "revolutionary" technology for blood testing that is said to require only a finger-prick of blood. The company grew its valuation to $9 billion without ever publishing any scientific data in a peer-reviewed medical journal. It turned out that the new technology was used only in 12 out of 200 tests on its menu, meaning that the business has been based on selling old technology at bargain basement prices subsidized by the venture-capital money. Further, it emerged that the new technology was not accurate, that the new technology has been shelved since last year, and in some cases when old technology was used, the lab personnel improperly handled the machines--all of which eventually led to a blanket retraction of two full years worth of test results. The company claimed that these results have been "corrected" in the last few weeks. It is unclear what "correction" means when such blood was taken from patients up to two years ago. The company is still in business, and Walgreens, one of its most prominent partners, continues its commercial relations with the company. For many years, the business and technology press has issued countless glowing reviews of the company (see this epic list just covering 2013-2015.) Until 2014, the company board consists entirely of politicians, former cabinet members and military leaders. All these individuals have been Theranosed.

The Wall Street Journal has done an exemplary job following this case, and deserves a Pulitzer for this effort. The latest revelation relating to the full-scale retraction is here.

Harvard Business Review devotes a long article to customer data privacy in the May issue (link). The article raises important issues, such as the low degree of knowledge about what data are being collected and traded, the value people place on their data privacy, and so on. In a separate post, I will discuss why I don't think the recommendations issued by the authors will resolve the issues they raised. In this post, I focus my comments on an instance of "story time", some questions about the underlying survey, and thoughts about the endowment effect.

***

Much of the power of this article come from its reliance on survey data. The main survey used here is one conducted in 2014 by frog, the "global product strategy and design agency" that employs the authors. They "surveyed 900 people in five countries -- the United States, the United Kingdom, Germany, China, and India -- whose demographic mix represented the general online population". (At other points in the article, the authors reference different surveys although no other survey was explicitly described other than this one.)

Story time is the moment in a report on data analysis when the author deftly moves from reporting a finding of data to the telling of stories based on assumptions that do not come from the data. Some degree of story-telling is required in any data analysis so readers must be alert to when "story time" begins. Conclusions based on data carry different weight from stories based on assumptions. In the HBR article, story time is called below the large graphic titled "Putting a Price on Data".

The graphic presented the authors' computation of how much people in the five nations value their privacy. They remarked that the valuations have very high variance. Then they said:

We don't believe this spectrum represents a "maturity model," in which attitudes in a country predictably shift in a given direction over time (say, from less privacy conscious to more). Rather, our findings reflect fundamental dissimilarities among cultures. The cultures of India and China, for example, are considered more hierarchical and collectivist, while Germany, the United States and the United Kingdom are more individualistic, which may account for their citizens' stronger feelings about personal information.

Their theory that there are cultural causes for differential valuation may or may not be right. The maturity model may or may not be right. Their survey data do not suggest that there is a cultural basis for the observed gap. This is classic "story time."

***

I wonder if the HBR editors reviewed the full survey results. As a statistician, I think the authors did not disclose enough details about how their survey was conducted. There are lots of known unknowns: we don't know the margins of error on anything, we don't know the statistical significance on anything, we don't know whether the survey was online or not, we don't know how most of the questions were phrased, and we don't know how respondents were selected.

What we do know about the survey raises questions. Nine hundred respondents spread out over five countries is a tiny poll. Gallup surveys 1,000 people in the U.S. alone. If the 900 were spread evenly across the five countries, their survey has fewer than 200 respondents per country. A rough calculation gives a margin of error of at least plus/minus 7 percent. If the sample is proportional to population size, then the margin of error for a smaller country like the U.K. will be even wider.

The authors also claim that their sample is representative of the "demographic mix" of the "general online population." This is hard to believe since they have no one from South America, Africa, Middle East, Australia, etc.

The graphic referenced above, "Putting a Price on Data," supposedly gives a dollar amount for the value of different types of data. Here is the top of the chart to give you an idea.

The article said "To see how much consumers valued their data, we did conjoint analysis to determine what amount survey participants would be willing to pay to protect different types of information." Maybe my readers can help me understand how conjoint analysis is utilized for this problem.

A typical usage of conjoint is for pricing new products. The product is decomposed into attributes so for example, the Apple Watch may be thought of as a bundle of fashion, thickness, accuracy of reported time, etc. Different watch prototypes are created based on bundling different amounts of those attributes. Then people are asked how much they are willing to pay for different prototypes. The goal is to put a value on the composite product, not the individual attributes.

***

Also interesting is the possibility of an "endowment effect" in the analysis of the value of privacy. We'd really need to know the exact questions that the survey respondents were asked to be sure. It seems like people were asked how much they would pay to protect their data, i.e. to acquire privacy. In this setting, you don't have privacy and you have to buy it. A different way of assessing the same issue is to ask how much money would you accept to sell your data. That is, you own your privacy to start with. The behavioral psychologist Dan Kahneman and his associates pioneered research that shows the value obtained by those two methods are frequently wide apart!

In a classic paper (1990), Kahneman et. al. told one group of people that they have been gifted a mug, and asked them how much money they would accept in exchange for it (the median was about $7.) Another group of people were asked how much they were willing to pay to acquire a mug; the median was below $3.

Is this the reason why businesses keep telling the press we don't have privacy and we have to buy it? As opposed to we have privacy and we can sell it at the right price?

***

Despite my reservations, the HBR piece is well worth your time. It raises many issues about data collection that you should be paying attention to. Read the whole article here.

Are science journalists required to take one good statistics course? That is the question in my head when I read this Science Times article, titled "One Cup of Coffee Could Offset Three Drinks a Day" (link).

We are used to seeing rather tenuous conclusions such as "Four Cups of Coffee Reduces Your Risk of X". This headline takes it up another notch. A result is claimed about the substitution effect of two beverages. Such a result is highly unlikely to be obtained in the kind of observational studies used in nutrition research. And indeed, a glance at the source materials published by the World Cancer Research Fund (WCRF) confirms that they made no such claim.

The headline effect is pure imagination by the reporter, and a horrible misinterpretation of the report's conclusions. Here is a key table from the report:

The conclusion on alcoholic drinks and on coffee comes from different underlying studies. Even if they had come from the same study, you cannot take different regression effects and stack them up. The effect of coffee is estimated for someone who is average on all other variables. The effect of alcohol is estimated for someone who is average on all other variables. The average person in the former case is not identical to the average person in the latter case. So if you add (or multiply, depending on your scale) the effects, the total effect is not well-defined.

In addition, you can only add (or multiply) effects if you first demonstrate that the two factors do not interact. If there is interaction, the effect of alcohol is different for people who drink less coffee relative to those who drink more. The alcohol effect stated in the table above, as I already pointed out, is for an average coffee drinker. Conversely, the protective effect of coffee may well vary with alcohol consumption.

***

The reporter also misrepresented the nature of the analysis. We are told: "In the study of 8 million people, cancer risk increased when they consumed three drinks per day. However, the study also found that people who also drank coffee, offset some of the negative effects of alcohol."

The reporter made it sound like a gigantic randomized controlled study was conducted. This is a horrible misjudgment. WCRF did not do any study at all, and certainly no researcher asked anyone to drink specific amounts of alcohol or coffee. The worst is the comment on people who drank coffee as well as alcohol. I can't find a statement in the WCRF report about such people. It's simply made up based on the false logic described above.

***

At one level, the journalist misquoted a scientific report. At another level, the WCRF report is rather disappointing.

The authors of the executive summary repeatedly use the language of causation. For example, "There is strong evidence that being overweight or obese is a cause of liver cancer." Really? Show me which study shows obesity "causes" liver cancer?

Take one of their most "convincing" findings: "Aflatoxins: Higher exposure to aflatoxins and consumption of aflatoxin-contaminated foods are convincing causes of liver cancer." The causation is purely an assumption of the panel who reviewed prior studies. In Section 7.1, readers learn that this cause-effect conclusion comes from "four nested case-control studies and cohort studies" for which "meta-analyses were not possible". So not a single randomized trial and no estimation of the pooled effect.

What is nicely done in the report is the inclusion of "mechanisms" which are speculative explanations for the claimed causal effects. It's great to have thought carefully about the biological mechanisms. Nevertheless, these sections are basically "story time" unless researchers succeed in establishing those unproven links.

This is part 3 of my response to Gelman's post about the DST/heart attacks study. The previous parts are here and here.

One of the keys of vetting any Big Data/OCCAM study is taking note of the decisions made by the researchers in conducting the analysis. Most of these decisions involve subjective adjustments or unverifiable assumptions. Not that either of those things are inherently bad - indeed, any analysis one comes across is likely to utilize one or more likely both. As consumers of such analyses, we must be aware what the decisions are.

The authors selected a period of time to study. For the research paper, this was January 2010 to September 2013. The database has existed since 1998, and it wasn't explained why the other years are irrelevant. Besides, in the poster presentation, the analysis was based on March 2010 to November 2012, a different but overlapping period. In any case, it is assumed that what happened during those months is representative.

Heart attack admissions are assumed to be a reliable indicator of heart attacks. (Now, it is true that in the publications, the researchers explain that they use admissions requiring PCI as a proxy for heart attacks but as per usual, the reporters drop the modifiers, thus becoming complicit in "story time": selling us one bill of goods (admissions) and then delivering another (heart attacks).)

What happened at "non-federal" hospitals is assumed to be the same as what happened at other hospitals.

What happened in Michigan is assumed to be representative of what happened in 47 other states. Also assumed is the lack of similar effect in the two states that do not change their clocks.

Cases of heart attack admission that did not result in PCI are not tracked by the data collector, and are assumed to be unimportant.

The data is assumed to be correct. Procedures to collect data and to define cases are assumed to be consistent across all participating hospitals.

The sample size is really small. There were a total of four Spring Forwards and three Fall Backs in the data.

What annoyed Andrew: no adjustments are made for multiple comparisons, which means they are assuming that the observed effect is not a random event. This is a strong assumption.

The effect of DST (if it exists) is assumed to be linear over the number of days from the DST time shift. In other words, the patients admitted on the Tuesday after Spring Break is assumed to be twice as exposed to DST as the patients admitted on the Monday. It's hard for me to get my head around this assumption.

If enough of these assumptions or modeling decisions bother you, you should ignore the study and move on.

***

This study, like many others, is a perfect illustration of story time. In such studies, the researchers present some data analyses that tie factor A with outcome B; frequently, neither factor A nor outcome B is directly measured so the researchers start to speculate about a web of causation. Sleepy readers may not realize that much of the discussion is pure speculation while the result from the data analysis is extremely limited.

Story time occurs right here:

Our study corroborates prior work showing that Monday carries the highest risk of AMI. This may be attributed to an abrupt change in the sleep–wake cycle and increased stress in relation to the start of a new work week.

The first sentence is based on their data but the second statement is pure speculation. There is absolutely nothing in this study to confirm or invalidate the claim that the sleep-wake cycle of the average Michigan resident who presents himself or herself to the hospital ward was disrupted, or that the stress experienced by said resident has increased.

***

The fallacy of causation creep shows up as well. The authors said, "Our data argue that DST could potentially accelerate events that were likely to occur in particularly vulnerable patients and does not impact overall incidence."

If the DST effect is merely a correlation, and not a cause, it would not follow that by changing DST, one can affect the outcome. The only way the above statement holds is if one interprets the correlation as causation. Their "data" have done no arguing; it is the humans who are making this claim.

***

For those mathematically inclined, here is the description of the statistical model used in estimating the "trend" of heart attacks (recall that the gap between the actual counts and this trend is claimed as the DST effect):

This model allowed for a cubic trend in numeric date as well as seasonal factors reflecting weekday (Monday–Friday), monthly (January–December) and yearly (2010–2013) effects. The model also adjusted for the additional hour on the day of each fall time change, as well as the loss of an hour on the day of spring time changes through the inclusion of an offset term.

... The impact of the spring and fall time changes on AMI incidence adjusting for seasonality and trend was assessed through the addition of indicator variables reflecting the days following spring and fall time changes as predictors to the initial trend/seasonality model.

I'm a bit confused by this description, which implies the weeks of the DST time shifts are included in this model used to predict the trend and seasonality. When subsequently, this model prediction is compared to the actual admissions count in the week after the DST time shift to compute relative risk, aren't they just looking at the residuals of the model fit?

Also conspicuously absent is any mention of a hospital effect or a geographical effect or a patient demography effect, all of which I'd think are possible predictors of heart-attack admissions.

Andrew Gelman discusses a paper and blog post by Ian Ayres on the Freakonomics blog. Their main result is summarized as:

We find that a ten percentage-point increase in state-level female sports participation generates a five to six percentage-point rise in the rate of female secularism, a five percentage-point increase in the proportion of women who are mothers, and a six percentage-point rise in the proportion of mothers who, at the time that they are interviewed, are single mothers.

Andrew finds these claims implausible, so do I.

Ayres uses the econometrics methodology called instrumental variables regression to support these claims. Since the data is observational, and as Andrew pointed out, there wasn't even a period of time in which one could find exposed and unexposed populations (since the TItle IX regulation was federal), one must treat such regression results with a heavy dose of skepticism.

It is useful to understand that causal claims are possible here only if we accept all the assumptions of the instrumental variables method.

Besides, plausibility is assisted by the ability to outline the causal pathways. It should be obvious that more females competing in college sports does not directly cause more females to become secular. The data on sports competition and on secularism come from different sources and this presents a hairy problem. The analysis would have been more convincing if it found that among the women who participated in college sports, more became secular; what the analysis linked was higher participation rate and higher secularism among all women in the state.

What is it about sports participation that would cause people to become secular? (The visual evidence from professional American sports would lead me to hypothesize the opposite--that sports participation may be associated with higher religosity!) Is this specific to the female gender? Do we find male secularism increase as sports participation by men went up?

As Andrew pointed out, the magnitude of the estimated effect seems too large to believe. I'd prefer to see these effects reported at more realistic increments. A jump of 10% participation is very drastic. For example, according to the chart here (the one titled "a dramatic, 40-year rise"), the percent of women participating in high school sports has moved just 2 percent from 1995 to 2011.

***

Andrew is right that this is an instance of "story time". And we are not saying that statisticians should not tell stories. Story-telling is one of our responsibilities. What we want to see is a clear delineation of what is data-driven and what is theory (i.e., assumptions). The plausibility of a claim depends on the strength of the data, plus whether we believe the parts of the theory that are assumed.