Posts categorized "Current Affairs"

Nice article in the New York Times about the "overdiagnosis" problem in cancer screening. The particular case is thyroid cancer in South Korea.

There are a number of things about any form of screening tests that one should always bear in mind:

Death rate is measured as the number of deaths divided by the number of people with the disease. The latter number increases with better diagnosis techniques.

Better diagnosis techniques for cancers inevitably identify tiny tumors, almost all of which will never develop into cancer during anyone's lifetime. Peggy Orenstein had a fabulous article a year ago about the same issue in breast cancer. In that case, those tiny tumors weren't even called cancer until the diagnosis movement sprang alive.

Once a tumor is labeled cancerous, patients will opt to fix it. Because these tumors would never have killed the patients, they inflate the number of diagnosed cases without increasing the number of deaths. So just by virtue of increased diagnoses, the death rate is brought down.

The immediate outcome of a nationwide screening program is to dramatically increase cases. The article is a little unclear about whether it is the number or the rate of death that did not fall. It doesn't really matter; either case leads us to conclude that the screening has failed to improve health.

One must not forget that screening tests, subsequent confirmatory tests, treatments, etc. all cost money, so there is a financial incentive to over-diagnose and over-treat.

In addition to the financial incentive, there is the issue that I raised in Chapter 4 of Numbers Rule Your World (link). A false negative is a very public error on the part of the medical establishment while a false positive (followed by say removal of the thyroid) is an unobservable error. So there is a statistical incentive to over-diagnose and over-treat.

In his new NYT column titled "Death by Data" (link), Brooks disparaged the recently celebrated practice of using machine learning in electoral politics, such as trying to win elections "Obama-style" by targeting investments on the people most likely to listen to his message, and trying to craft electoral messages by testing and measuring how people react to certain words and phrases.

Electoral politics is another success story often cited by Big Data people.

Brooks said a few things that pinpoint one consequence of how Big Data is being used today. Here are some nice quotes:

"As politics has gotten more scientific, the campaigns have gotten worse, especially for the candidates who overrely on these techniques."

"Data-driven politics is built on a philosophy you might call Impersonalism. This is the belief that what matters in politics is the reaction of populations and not the idiosyncratic judgment, moral character or creativity of individuals."

"Data-driven politics assumes that demography is destiny, that the electorate is ... a collection of demographic slices."

"... it is more important to target your likely supporters than to try to reframe debates or persuade the whole country."

"It puts the spotlight on messaging and takes the spotlight off product: actual policies."

The question in my mind is whether these issues are caused by the data-driven philosophy of the analysts, as Brooks asserts, or by the win-at-all-cost philosophy of the politicians.

By changing the context, these statements also apply to business use of data. A lot of the machine learning models improve the numbers but not necessarily the user experience or customer satisfaction. The prevalence of tricks used to promote unintended clicks of display ads is a powerful reminder of Brooks's "Impersonalism" idea.

I'm preparing my talk next week at the Business Intelligence Innovation Summit in Chicago, which is titled "The Accountability Paradox in Big Data Marketing". More data has not made us more accountable, so far.

***

I'm also unhappy about cookie-cutter campaign speeches that are but a string of buzzwords proven to appeal to the electorate by A/B testing results. But this is made worse by the politicians who are willing to utter these words brainlessly, and by the politicians who are willing to discard their own beliefs in order to win elections, and, also by the electorate who take these politicians at their words.

As Brooks correctly diagnosed, by using OCCAM data (link), particularly observational data without controls, the analysts surface correlations and have nothing to say about causation. This leads to a situation where the models provide little if any information to the politicians about the desires or wishes or expectations of the electorate. All they know is that if they include the word "family" and exclude the word "fear", they may get a higher rating, and if they get higher ratings among persuadable segments of the likely voters, they may win an election.

***

The second half of Brooks' article veers off course, displaying the arbitrary nature of the non-data-driven argument. He parades a list of names of past candidates, and claims that their failures are due to overly relying on data. Obama won the election but lacked any coherent agenda, for example. But there is no evidence to make the connection that subsequent failure is caused by bad use of data.

A lot of Big Data analyses default to analyzing count data, e.g. number of searches of certain keywords, number of page views, number of clicks, number of complaints, etc. Doing so throws away much useful information, and frequently leads to bad analyses.

***

I was reminded of the limitation of count data when writing about the following chart, which I praised on my sister blog as a good example of infographics, a genre chock-full of deplorable things.

On the other blog, I explained why I prefer to hide the actual numbers, from a dataviz perspective.

There is also a statistical reason for not drawing undue attention to the counts.

These counts do not indicate the severity of the injuries: some may have knocked the player out of the game, others may have been much milder. Some injuries may be sustained by first-team players who spend a much longer time on the field than backups, thus raising their rate of injuries.

Another statistical consideration is heterogeneity. I'd like to see a small-multiples version of this chart, with the data split by position on the field. I think it will be quite telling which body parts are hurt more depending on one's role in the game. Similarly, splitting by age, body size, and other factors will yield interesting insights.

***

At about the same time, I was reading the July issue of Significance magazine (an RSS and ASA publication). Here is the link (not free).

In an article about assessing whether iceberg risk was particularly high in the year of the Titanic, the authors quantified the risk in terms of "number of icebergs crossing latitude 48 N each year". It'd seem worthwhile to ask whether there is also a relevant size distribution.

Then, in an article about "black box modeling" (i.e. data mining) by Max Kuhn and Kjell Johnson, they invoke an example of the FDA adverse event reporitng database, an example of "events data". Events data is everywhere these days, and the most popular analyses of such data revolve around counting the number of adverse events. The severity and type of events are frequently ignored.

P.S. In their otherwise gung-ho article, Kuhn and Johnson also point to one of the biggest challenges of OCCAM data. "If there is a systematic bias in a small data set, there will be a systematic bias in a larger data set, if the source is the same."If one is analyzing the FDA adverse events database, one must hope to apply the learning to people who don't yet have adverse events, but then such an analysis would be flawed since the database doesn't have any controls, i.e. people without these adverse reactions.

The New York Times Magazine has a pretty good piece about the use of OCCAM data to solve medical questions, like diagnosis and drug selection. I'm happy that it paints a balanced picture of both the promise and the pitfalls.

Here are some thoughts in my head as I read this piece:

Small samples coupled with small effects pose a design problem in traditional clinical trials. The subjects of the NYT article claim that OCCAM data can fill the void. If a treatment is highly effective, even small clinical trials will find the effect. So the underlying issue is less sample size than effect size.

Counterfactual evidence is almost always absent from OCCAM data because of lack of controls (the first “C” in OCCAM). The lede in the story concerns a girl who was given an anti-clotting drug because a doctor suspected she had elevated risk of blood clotting, and the girl did not develop a clot. Statisticians are not impressed by such evidence, because we don’t know whether the drug was truly responsible for the outcome. (It's a correlation until proven guilty.) If the girl had not taken the drug, would she have developed a clot? This point is argued in the article by Chris Longhurst: “At the end of the day, we don’t know whether it was the right decision.” This ignorance puts us in a dangerous territory, making it a challenge to tell apart the prescient from the charlatan.

The Big Data world is filled with "events data". You have a log of everyone who clicked on a particular button, or a log of everyone who called your call center, etc. You only have the cases but not any non-cases (e.g. the unhappy customer who did not call the call center). Heartwarming stories like the girl's avoidance of clotting get repeated (or become viral, in modern terminology) but stories of failure are not usually deemed worth reporting. The following table shows four possible stories:

The media imposes a filter so that only the one story will get through. Without mentally accounting for the other stories, one can't judge how important the reported story is!

***

In the July issue of Significance, the magazine by RSS and ASA, Julian Champkin contributed a great profile of Iain Chalmers, the founder of the Cochrane Collaboration, the organization that aggregates and summarizes trial results. I saw this fantastic quote, which speaks to the New York Times article:

Dr. Spock’s 1946 book, Baby and Child Care, was ... read by a huge proportion of [parents around the world]; throughout its first 52 years in print, it outsold every other book except the Bible. “It recommended that babies should be laid to sleep on their stomachs. Now we know that doing that increases the risk of cot (crib) death. Tens of thousands of babies died needlessly because of that advice.”

I have earlier reported that Princeton's new President has initiated a review of their "grade deflation" policy that was put in almost ten years ago. As you may recall (link), grading in U.S. colleges has become a farce: at top-tier schools, getting an A means you are an average student; not getting an A is many times more informative than getting an A.

The new administration at Princeton has now decided to abandon this cause, pending a faculty vote next month, and return to the old normal. This development is highly regrettable, and a failure of leadership. (The new policy leaves it to individual departments to do whatever they want.)

The recent Alumni publication has two articles about this topic, one penned by President Eisgruber himself. I'm not impressed by the level of reasoning and logic displayed here.

***

Eisgruber's piece is accompanied with a photo, captioned thus:

The goal of Princeton's grading policy is to provide students with meaningful feedback on their performance in courses and independent work.

Such a goal is far too vague to be practical. But let's take this vague policy at face value. How "meaningful" is this feedback when 40% of grades handed out are As, and 80% of grades are either As or Bs? (At Stanford, Harvard, etc., the distributions are even more skewed.)

This tortured logic seems to suggest that the deflated grades are somehow less "meaningful" than the inflated grades of the past. How so? (The grade-deflation policy establishes guidance to limit As to about 35%, apparently by department but also not strictly enforced.)

Eisgruber offered the following justification for rethinking the grade deflation policy:

Almost 10 years after its enactment, the policy remained a lightning rod of controversy and a considerable source of stress for many students, parents, alumni, and faculty members. And regrettably, none of our immediate peer institutions followed our example in taking tough measures to address grade inflation. As a result, Princeton, which ought to be renowned for the unsurpassed quality of its teaching, was attracting more attention for the severity of its curve.

Again:

The committee found no evidence that the grading policy hindered Princeton students' competitiveness in seeking postgraduate employment, fellowships, ... Perceptions of the policy, however, have been a very real source of stress for students, which concerned the committee.

None of this has anything to do with the meaning of grades. In fact, if stress is the primary concern, one might do away with grades altogether.

***

The committeee also did its best to contort the interpretation of data. This is the chart showing the extent of grade inflation at Princeton (kudos to the administrators for making it public - looking to you Stanford, Harvard, Yale for the conspicuous silence):

A reasonable conclusion from the above chart is that the creation and enactment of the grade deflation policy has led to a shift of the distribution of grades from As to Bs. Another clear message is that the 35% target was not robustly applied as even after the policy came into place, the proportion of As stayed above 40 percent. (I am not sure how to explain the apparent plunge in students taking Pass/D/Fail courses around the same time. I'm guessing a separate policy change occurred at that time.)

Worryingly, this is not how Princeton views the chart. The committee members focused on the fact that the shift started "in advance of current policy's implementation in 2005", therefore "the numerical targets may have been only partially responsible for reversing a pattern of higher grading". They then claimed that this other cause is "sustained conversation about grading." Without any data, as far as I can tell, not only did they claim the existence of such an effect but they also asserted that this factor "may be as effective as numerical targets in keeping rising grades in check". In other words, based on the chart, they concluded that the change in proportion of As is caused by two equally important factors, the explicit policy to curb As, and "sustained conversation about grading".

In fact, the committee members seemed confused as they also said "the fraction of A grades... increased between 2009 and 2013 as monitoring of the policy grew more lax." What happened to the sustained conversation?

***

There are two central contradictions between the diagnosis and the proposed solution.

The committee made a case that changing grading policy requires cooperation. If only Princeton takes the lead, and no other peer institution follows, then Princeton is at a disadvantage. And yet, they are proposing to dismantle a coordinated school-wide policy in favor of each department doing its own thing. I'd argue that at the department level, the same dynamic applies. There is no incentive for any particular department to take the lead in curbing As. In fact, the natural equilibrium is for all departments to grade inflate. (Think prisoner's dilemma.)

The other concern of the committee is the "stress" caused to everyone by the new policy. Presumably, such stress is what sustains a conversation about grading. I'd like to know how it is that such conversation would persist in the future.

The final word appears to be a rejection of quantitative measurement. Here's Eisgruber:

The committee wisely said: If it's feedback that we care about, and differentiating between good and better and worse work, that's what we should focus on, not on numbers.

The wisdom has eluded me.

PS. [11/25/2014] Andrew Gelman and his readers have some good discussion about this post here.

Some behind-the-scenes comments on my recent article on New York's restaurant inspection grades; it appeared on FiveThirtyEight this Tuesday.

***

The Nature of Ratings

This article is about the ratings of things. I devoted a considerable amount of pages to this topic in Numbersense (link) - Chapter 1 is all about the US News ranking of schools. A few key points are:

All rating schemes are completely subjective.

There is no "correct" rating scheme, therefore no one can prove that their rating scheme is better than someone else's rating scheme.

A good rating scheme is one that has popular acceptance. If people don't trust a rating scheme, it won't be used. (This is a variant of George Box's quote: "all models are false but some are useful".)

Think of a rating scheme as a way to impose a structure on unwieldy data. It represents a point of view.

All rating schemes will be gamed to death, assuming the formulae are made public.

Based on that, you can expect that my goal in writing the 538 article is not to praise or damn the city's health rating scheme. My intention is to describe how the rating scheme works based on the outcomes. I want to give readers information to judge whether they like the rating scheme or not.

OCCAM Data

The restaurant grade dataset is an example of OCCAM data. It is Observational, it has no Controls, it has seemingly all the data (i.e. Complete), it will be Adapted for other uses and will be Merged with other data sets to generate "insights". In my article, I did not do A or M.

Hidden Biases in Observational Data

Each month (or week, check), the department puts up a dataset on the Open Data website. There is only one dataset available and the most recent copy replaces the previous week's dataset. The size of the dataset therefore expands over time.

Anyone who analyzes grade data up to the most recent few months is in for a nasty surprise. As the chart on the right shows, the proportion of grades that are not A, B or C (labeled O and gray) spikes up by 10 times the normal amount during the last two months. This chart is for an August dataset, and is not an anomaly. It's an accurate description of the ongoing reality.

On first inspection, if a restaurant is given a B or C, the restaurant has the right to go through a reinspection and arbitration process. During this time, the restaurant is allowed to display the "Grade Pending" sign. It appears that it can take up to four months for most of the B- or C-graded restaurants to be finished with this process. Over this period of time, many of the pending grades will flip to one of A, B or Cs. The chance that they will flip to B or C is much higher than the average restaurant (i.e. for which we don't know they have a Grade Pending).

Indeed, the proportion of As in the most recent two months is vastly biased upwards as a result of the lengthy reinspection process.

For this reason, I removed the last two months from my analysis.

How might this bias affect your analysis?

If you drop all Pending grades from your analysis (while retaining the A, B, and C grades), you have created an artificial trend in the last two months.

If you keep the last available grade for each restaurant, you have not escaped the problem at all. In fact, you introduce yet another complication: B- and C- graded restaurants have older inspection dates than the A-graded restaurants. Meanwhile, those Pending grades are still dropped.

If you automatically port this data to a mapping tool, or similar, you are displaying the biased data and the unknowing users are misled. In fact, the visualization no longer can be interpreted.

IMPORTANT NOTE: The data is NOT WRONG. Data cleaning/pre-processing does not just mean finding bad data. Much of what statisticans do when they explore the data is to identify biases or other tricky features.

The Nature of Statistical Analysis

[Captain Hindsight here.] Of course, I didn't know or guess that the Grade Pending bias would be a problem. I did the first analysis of the data using a July dataset, and by the time I was drafting the article for FiveThirtyEight, it was already August so I "refreshed" the analysis with the latest dataset. That's when I discovered some discrepancies that led me to the discovery.

This is the norm in statistical analysis. Every time you sit down to write something up, you notice additional nuances or nits. Sometimes, the problem is severe enough I have to re-run everything. Other times, you just decide to gloss over it and move on.

Tom Davenport is one of the leading voices on business analytics, and he has a new piece titled "Why are most 'targeted' marketing offers so bad?" in which he expanded on a question I raised in my HBR article. Tom's book Competing on Analytics is a classic. He has a great appreciation for the business of the data business.

In the new feature, Davenport classifies marketing offers he gets into five types:

retargeted offers, well-meaning but poorly-targeted offers, offers that benefit the offerer rather than the potential consumer, offers that are OK except for the context, and well-targeted offers that benefit you

and he certainly speaks some truths.

On retargeted offers, he reminds marketers "for the most part, if we abandon a search or purchase, we intended to do so."

On well-meaning and poorly-targeted offers (like sending men offers for women's clothing), he suspects that the retailers didn't try hard enough to mine their data.

***

I think there are some technical deficiencies partially responsible for these issues.

Firstly, human behavior and preferences can never and will never be reduced to a set of equations. Thus, every targeting algorithm has to balance false positives and false negatives. I have written about this a lot. Start with Chapter 4 of Numbers Rule Your World or the Groupon and Target chapters in Numbersense.

Secondly, the existence of "retargeting" as a business is entirely due to perversion of measurement, which I address in Chapters 1-2 of Numbersense. I also wrote about how online marketing is measured here. Briefly, the more you flood customers with impressions, the more likely your impression is close in time to a purchase event, the more credit you get for "influence".

Thirdly, the data is noisy and few are investing any time in getting rid of bad data. Just think about it for a second. Let's say you are a guy. If your son let his classmate use your iPad to buy something from a girl's clothing site just once, you are forever tagged as a girl's clothing buyer.

The Big Data mindset to solving this problem: they want to be even more creepy; they want "all of your data". If everything is being tracked, but by hundreds or thousands of different entities, that wouldn't work either, so the Big Data end game is one all-knowing monopolist of all of your data.

But this path is entirely a dead end. Here's something to ponder - the fact that you visited a particular website is today equated to an expression of interest in that website. The data measure what you do, and why you do it.

The solution is humility, and accepting a level of uncertainty. Enhance observed data with more direct, even qualitative data. Remove noise, which is a way of managing the uncertainty.

The statistics community loves to think of our subject as highly practical and relevant to the general population. And this is true.

The average person has a poor grasp of basic statistical thinking, even if he or she has taken one or more statistics courses. This is true, yet many in our community are in denial.

Chapter 1 of Numbers Rule Your World deals with the most basic of basics... the use of statistical averages and the variance around the average. If you're designing a first course in statistics, or a course in statistical literacy for non-majors, you'd hope that students would at least pick up this one concept.

***

Apparently, neither the reporter nor the editors at LA Times has taken a basic statistics course, or more likely, has not learned about statistical averages and probability distributions in such a course well enough to apply to real-world data.

The depressing article discusses a tabulation of the "richest" cities in the world, using the metric number of millionaires per resident. New York City is the "richest" in the U.S. boasting 4.63% (389K) millionaires.

Here is how the reporter "enlivens" the boring statistic:

Walk down the street in New York and you're virtually guaranteed to see several millionaires. That's because more than 1 in every 25 New Yorkers is a millionaire, according to a study released Tuesday.

For this to be true, we have to assume that the population of New York is evenly distributed geographically. Further, we have to assume that millionaires are also evenly distributed. Both these assumptions are laughably bad. The statistical average is useless here when you start talking about "walking down every street".

Besides, what's the chance that a millionaire is walking down the street in Manhattan, let alone in Staten Island? Are they more likely found in a cab or limo or private car or helicopter?

The reporter should also have done homework on how the researcher classifies the residence of millionaires. Many of them have multiple homes, most likely in global locations. They may have an address in Manhattan, for example, but what if they spend most of their time in the Hamptons or Bermuda? You can count them as New York residents but surely you won't see them on the streets of New York!

***

When only a minority of students exhibit little to no ability to apply statistical thinking, one could blame the students; but when a majority of those who have taken statistics courses commit the most fundamental errors, one must blame the teaching.

Facebook data scientists are being blasted for a social psychology experiment they ran in 2012 in which they varied the amount of positive/negative content exposed to users in newsfeeds and measured whether this affected the positive/negative content posted by those users. (link to WSJ report; link to paper)

I'm perplexed by the reaction. Boing Boing's Cory Doctorow calls it "likely illegal", who links to James Grimmelmann, a law professor. Slate slams it as "unethical". NPR proclaims that "we" are "lab rats, one and all".

***

About Consent

The biggest gripe is that users who were randomly selected to be part of the test were not informed. Facebook argues that consent is given in the overarching terms and conditions which all users must agree to.

I am against all forms of forced consent that is practiced by Internet companies like Facebook and Google but that's a much larger issue of which scientific experiments like this is but a small part. The same critics don't seem to mind if the experiment are conducted for financial gain (like generating advertising revenues) but they grumble about an academic exercise designed to verify a theory of social psychology.

Critics are emotionally charging the conversation by claiming that "Facebook is manipulating users' emotions". The truth is every form of marketing and advertising is a form of manipulating users' emotions. If you are against this particular experiment, you should be against all the other experiments (i.e. so-called A/B tests) that are conducted every day by the same companies on the same users who are not informed and have not agreed to be "lab rats".

In fact, the same critics should have been incensed more by those advertising experiments, for which the businesses are looking for "actionable" insights, meaning that they are looking for ways to manipulate not just your emotions but your actions and behavior, such as what you click on, what you view, and what you spend money on.

Scientific Progress

Playing the devil's advocate for the moment, I'd suggest that this type of large-scale randomized experiments has the potential of bringing a revolution to psychology experiments.

I have never been a fan of the typical psychological experiment that we are forced to accept as legitimate "science": you know, those experiments in which a professor recruits 10 or 50 students via campus posters offering $20 for participation. The severe limitation of the sample, both size and composition, does not stop researchers from generalizing the results to all humans. The standard claim is that the observed effect is so large as to obviate the need for having a representative sample. Sorry - the bad news is that a huge effect for a tiny non-random segment of a large population can coexist with no effect for the entire population.

I believe scientists should be actively addressing the privacy and ethical concerns of such experiments but not dismissing them categorically.

The Opt Out Solution

There are at least two components of the possible solution. One is having an organized review board like institutional review boards. This should be created across industry, not within a company. Perhaps a list of ongoing experiments should be made available for those who care enough to review it.

More importantly, users of Facebook or other websites should be allowed to opt out of any experiments (again, not just science experiments but also advertising-motivated experiments).

I think many critics are being incredibly facetious by asking for prior consent. For many psychology experiments, we need to use double blinding. The experiment at the center of this controversy would have been neutered if the participants know what is being changed.

The Issue of Harm

One of the weakest arguments raised against Facebook is the allegation of harm. James Grimmelmann, who has been cited by various articles, wrote definitively: "The study harmed participants." (link) How so? He explains:

The unwitting participants in the Facebook study were told (seemingly by their friends) for a week either that the world was a dark and cheerless place or that it was a saccharine paradise. That’s psychological manipulation, even when it’s carried out automatically.

That's it? And he condones advertisers and politicians who manipulate our emotions because somehow we should accept lower standards in those arenas.

The Fallacy of Paying for Free Service

The WSJ commits the same fallacy as a lot of other journalists when it comes to explaining why users should submit to experimentation. It says:

Companies like Facebook, Google Inc. and Twitter Inc. rely almost solely on data-driven advertising dollars. As a result, the companies collect and store massive amounts of personal information.

The same argument has been used to support massive invasion of privacy and indiscriminate data collection.

There are several problems with this argument:

Firstly, most companies that are collecting massive amounts of data on their users today are not solely advertisers. Amazon is not an advertiser. Netflix is not an advertiser. Cable and phone companies are not advertisers. Banks are not advertisers. Your doctor is not an advertiser. (I mean, not primarily advertisers; some of these actually do earn advertising dollars.)

Secondly, advertising dollars are there whether or not there is data. I have yet to see a proper study that shows that digital marketing dollars are incremental spending, not just repurposed spend shifting from offline channels. Indeed, the frequent claim that digital advertising solves the half-the-traditional-advertising-spend-is-wasted problem is evidence that the digital marketing play is mostly spend shifting, not spend creation. The alternative world in which these digital advertisers and data collectors do not exist is not one in which the advertising market is half the size of what it is today.

Thirdly, advertisers do not need personal information. If advertising is the purpose of the data collection, anonymous data is just as good. This is because brands have an identity. Brands do not want to be different things to different people. Brand messaging, just like other forms of messaging, benefits from simplicity. Nike wants everyone to use "just do it"; Nike is never going to want a million slogans for a million people. Thus, the idea that collecting massive amounts of personal data is a "result" of data-driven advertising is bogus.

Fourthly, the business model of Facebook, Google, etc. is a choice. No one but themselves is forcing them to rely on advertising dollars. Google for instance makes Android, Chrome, etc., all of which are products that they can charge money for.

The journalists who keep printing the argument that we have to accept mass surveillance in order to support the business model of "free" are merely repeating a marketing message without thinking about it. They have fallen victim to emotional manipulation.

To add to my prior post, having now read the published paper on the effect of DST on heart attacks, I can confirm that I disagree with the way the publicist hired by the journal messaged the research conclusion. And some of the fault lies with the researchers themselves who appear to have encouraged the exaggerated claim.

Here is the summary of the research as written up by the researchers themselves. First I note the following conclusion:

and right before, they write this explanation of the "timing" effect:

So indeed, if I were to believe the research, someone may have a heart attack on Monday instead of Tuesday "as a result of" daylight savings time in the spring. And wait a minute, by reversing this change in the fall, we seemingly postpone some heart attacks by two days. Hence my assertion that even if true, the phenomenon is not interesting.

In fact, I think this study provides negative evidence toward the idea that DST causes heart attacks. Here is how the authors describe their hypothesis:

The new data show no statistical difference in overall heart attack (admissions) for either period. That is their main result.

***

In this post, I want to discuss the challenges of this type of research. The underlying data is OCCAM (see definition here). It is observational in nature, it has no controls, it is seemingly complete (for "non-federal hospitals in Michigan), it is adapted and merged (as explained in the prior post).

Start with the raw data, in which there is a blip observed the Monday after Spring Forward. This problem is one of reverse causation: we see a blip, now we want to explain it.

Spring Forward is put forward as a hypothetical "cause" of this blip. But, we should realize that there is an infinity of alternative causes.

Seasonality is clearly something that needs to be considered. Is it normal to see an increase in admissions from Sunday (weekend) to Monday? To establish how unusual that blip is, we need to manufacture a "control," because none exists in the data.

In the poster presentation, the researchers use a simple control: what happened the week before? (This is known as a pre-post analysis.) The red line shown on the chart would suggest that a jump on Monday is unusual. This chart is a reproduction of the two charts from the poster but superimposed.

One can complain that the pre-1-week control is too simplistic. What if the week before was anomalous? A natural way forward is to use more weeks of data in the control. In the published paper, the researchers abandon the pre-1-week control, and basically use several years of data to establish a trend.

But this effort is complicated by the substantial variability in the data over time:

(I can't explain why the counts here are so much lower than the counts given in the post-DST week line in the first chart. In the paper, they describe the range of daily counts as 14 to 53.)

So expanding the window of analysis is double-edged. On the one hand, we guard against the one week prior to Spring Forward being an anomaly; on the other hand, we include other weeks of the year that are potentially not representative of the period immediately prior to Spring Forward.

The researchers do not simply average the prior weeks--they actually produce a statistical adjustment on the raw data, and call that the "trend model prediction". This is a very appealing concept. What we really want to know (but can't) is the "counterfactual": the number of cases if there were no DST time change.

In the next chart (reproduced from their paper), the "trend" line is what the authors claim the counterfactual counts would have been. They then compare the red line to the blue line (actual counts) and make claims about excess cases.

***

Of course, the devil is in the details. If you're going to make predictions about the counterfactual, the reader has to gain confidence in the assumptions you use to create those predictions.

One way to understand the statistical adjustment is to plot the raw data and the adjusted data side by side. Unfortunately we don't have the raw data. We do have the one week of pre-DST data from the poster. So I compare that to the "trend".

This chart raises two questions. First, the predicted counts in the paper are about 30% higher than the counts in pre-week of the poster. Second, the pre-week distribution of count by day matches the "trend" poorly.

While the pre-count is not expected to match the predicted "trend" perfectly, I'd expect that the post-counts should match since both the poster and the paper address what happens the week after the DST time change.

Strangely enough, the counts in the paper are 35% higher than those in the poster for the post-DST week! I'm not sure what to make of this: maybe they have expanded the definition of what counts as "hospital admissions for AMI requiring PCI".

The attempt to establish a control by predicting the counterfactual is a good idea. Given the subjectivity of such adjustments, researchers should be rigorous in explaining the effect of the adjustments. Stating the methodology or the equations involved is not sufficient. The easiest way to explain the adjustments is to visualize the unadjusted versus the adjusted data. The direction and magnitude of the adjustments should make sense.

***

Going back to the problem of reverse causation. Seasonality, trend and DST are only three possible causes for the Monday blip. Analysts must make an effort to rule out all other plausible explanations, such as bad data (e.g. every time the time changes, some people forget to move their clocks).

As I am testing your patience again with the length of this post, I will put my remaining comments in a third post.