April 2019

In the previous post, I described how some researchers found insights from a database of fatal car crashes. This dataset has all the markings of OCCAM data, which I use to summarize the characteristics of today's data.

Observational

the data come from reports of crash fatalities, rather than experiments, surveys, or other data collection methods

No Controls

the database only contains the cases, i.e. fatalities but not controls, which in this case should be drivers who did not suffer fatalities. The study design creates a type of control but as discussed in the previous post, the "controls" are still fatalities, just that they happened during different weeks. Such a study design requires the untested assumption that under normal circumstances, the frequency of fatalities to be constant within the three-week window of the study.

Seemingly Complete

it is assumed that all crashes involving fatalities are reported accurately in the database. This assumption is frequently discovered to be wrong when the analyst digs into the data. A recent example is the Tesla auto-pilot analysis: even though in theory Tesla should have data on all its vehicles, the spreadsheet contains a large number of missing values.

Adapted

the fatality data are collected for a number of uses, none of which is to investigate the potential effect of 420 Cannabis Day. Adapted data is sometimes called found data or data exhaust

Merged

For this analysis, the researchers did not merge datasets. Most of the time, they do. For example, one of the commenters suggests looking at the effect of temperature. To do that requires merging local temperature data with the fatality data. Merging data creates all kinds of potential data quality issues.

***

In this post, we shall forget about the conclusion of the previous post, that April 20 may not be extraordinary. We accept that April 20 is an unusual day.

The first question to ask is: unusual in what way?

Let's look at the histogram again:

April 20 is unusual in having a higher number of fatal car crashes compared to the average of April 13 and 27.

That is what we learned from the data. Our next question is: why is April 20 worse?

According to the original study, the reason for the excess fatalities is excess cannabis consumption on April 20 because 420 is cannabis celebration day.

But at this point, we only have story time. Story time is the spinning of grand stories based on tiny morsels of data. The moment hits you in the second half of a newspaper article or research report after the author presents the data analyses, when you realize that story-telling has begun, and the report strays far from the evidence.

In this case, it's the link between excess fatalities and excess cannabis consumption that is tenuous. The problem goes back to OCCAM data, and lack of proper controls. If we could perform an experiment, the evidence would have been interpreted more directly.

The database of fatalities does not contain data on cannabis consumption. The original study has some info on "Drug police report" with over 60 percent of the cases listed as "not tested or not reported". This information is not used to argue one way or another about cannabis consumption.

The next step for this type of study is finding corroborating evidenceto support the causal story. For example, are more of these accidents occurring around neighborhoods in which 420 Day is being celebrated? Can we find neighborhoods that only started celebrating 420 Day after a certain year and look at whether a jump in crash fatalities occurred after that year? Do people drive more or less frequently after they smoke weed? Are there proxies for cannabis consumption? (for example, maybe cannabis users are more likely to drive certain cars.) etc.

Harper and Palayew looked into whether the crash ratio got worse over time because cannabis consumption may have increased over time. They failed to see this, which weakens the conclusion.

When I ask job-seekers what their biggest obstacle is to finding a job in data science and analytics, one of the most frequent answers is performing during the interview. Some of them are stumped by technical interviews (coding) while even more are worried about the case interviews.

The purpose of the case interview is to test critical thinking. It is as challenging for the job candidate as for the hiring manager! Technical questions have pretty standard answers, and it's easy to score the answers. Case interviews are like essays - the hiring manager has to make judgment calls.

My piece on critical thinking is featured at the KDNuggets blog, which I've followed since I was an analyst. In this first part, I explain the two aspects of critical thinking that the case interviewer is typically looking for. There will be a part 2 in which I provide some practice examples.

And if you're wondering about the acronym, it's Driving Under the Influence of Weed on 420 Day, which I learned from Andrew Gelman's blog is a day of celebration of cannabis.

Andrew's blog post is about the exemplary work done by Sam Harper and Adam Palayew, debunking a highly-publicized JAMA study that claimed that 420 Day is responsible for a 12 percent increase in fatal car crashes.

The discussion provides great fodder for examining how to investigate observational data, which is what most of Big Data is about. It is a cautionary tale for what not to do.

***

The blog begins with Harper/Palayew channeling Staples/Redelmeier, the authors of the study: "fatal motor vehicle crashes increase by 12% after 4:20 pm on April 20th (an annual cannabis celebration)."

This short sentence captures the gist of the original study but it omits an important detail: to what is the increase relative?

If we ran an experiment, we would recruit a group of drivers, and select half of them at random to smoke weed on April 20. Then, we would count what proportion of drivers suffered fatal car crashes after 4:20 pm. The analysis would be straightforward: what's the difference in proportions between the two groups? With such an experiment, it is possible to draw a causal conclusion.

Alternatively, we could conduct a case-control study. The cases are the drivers who suffered fatal car crashes on April 20. We collect demographic data on these drivers. Then, we define a set of "controls", drivers who did not suffer car crashes on April 20 but on average, have the same demographic characteristics as the cases. Next, we need data on cannabis consumption, preferably on April 20. We want to show that the level of cannabis consumption is significantly higher for cases than for controls.

(For further discussion of these analysis designs, see Chapter 2 of Numbers Rule Your World (link).)

The actual study was neither experiment nor case-control. It was a piece of pure data analysis, based on "found data". I like to call this "adapted data," the "A" in my OCCAM framework for Big Data - data collected for other purposes that the researcher has adapted for his/her own objectives. In this study, the adapted data come from a database of fatal car crashes.

So how was the adapted data analyzed? Harper/Palayew answer this question in their second description of the research:

Over 25 years from 1992-2016, excess cannabis consumption after 4:20 pm on 4/20 increased fatal traffic crashes by 12% relative to fatal crashes that occurred one week before and one week after.

The cases are the fatal car crashes that occurred after 4:20 pm on 420 Day. The comparison isn't to the drivers who did not suffer crashes on the same day. The reference group consisted of fatal car crashes that occurred after 4:20 pm on 4/13 and 4/27. The difference in the average number of crashes is taken to result from "excess cannabis consumption".

Notice that such a conclusion requires a strong assumption. We must believe that absent 420 Day, 4/13, 4/20 and 4/27 ought to have the same fatal crash frequencies.

***

You hopefully recognize that the analysis design for adapted data is on much shakier ground than either an experiment or a case-control study.

Harper/Palayew's initial debunking focused on one issue: what's so special about April 20? To answer that, they repeated the same analysis on every day of the year. The following pretty chart summarizes their finding:

The red line is the line of no difference (between the analyzed day and the two reference days from the week before/after). Each vertical line is the range of estimate of the difference for a specific day of the year. The range for 4/20 is highlighted, and several other days with elevated fatal crash counts are labeled.

The chart was originally published here, with the following commentary: "There is quite a lot of noise in these daily crash rate ratios, and few that appear reliably above or below the rates +/- one week." Andrew adds: "Nothing so exciting is happening on 20 Apr, which makes sense given that total accident rates are affected by so many things, with cannabis consumption being a very small part."

While the chart looks cool, and sophisticated, the following histogram of the same data helps the reader digest the information.

I took the daily estimates of the fatal crash ratios from Harper/Palayew's published data. Each ratio presents the crashes on the analysis day relative to the crashes on the two reference days. The histogram shows the day-to-day variability of the crash ratios, which is what we need to answer the question: how special is 4/20?

The histogram is roughly centered at 1.0 meaning no observed difference. The black vertical line shows the ratio for 4/20. It is leaning right - in fact, it is at the 94th-percentile. In classical terms, this is a p-value of 0.06, barely significant.

Will JAMA editors accept one research paper for each of these days? The work is already done - the rest is story time.

P.S. [4/27/2019] Replaced the first chart with a newer version from Harper's site. This version contains the point estimates that the other version did not. Those point estimates are used to generate the histogram.

If you are a frequent flier, you already know the gist of this nice article by the BBC: that airlines are allowed to sandbag the flight durations. A flight that takes 60 minutes will be portrayed to fliers as taking twice as long, if not longer.

The airlines are even allowed to lie about this practice. When your flight is delayed taking off, the captain claims that s/he will “make up for the delay,” as if the plane could be driven faster on command. (Were they deliberately going slower before?) The truth is that the schedule is padded, so that it can absorb a limited amount of delay.

This quote sums the situation up: “By padding, airlines are gaming the system to fool you.”

At the very bottom of the article, you’d find the potential motivation – to avoid compensating travelers for long delays, as required by law in some countries.

***

The situation here is similar to the road congestion problem discussed in Chapter 1 of Numbers Rule Your World (link). Managing perceived time is as important as managing actual time experienced by the traveler. Of course, reducing actual wasted time is preferred, especially to scientists working on the problem, but when the road/sky capacity is fixed and over-subscribed, it’s almost impossible to attain. The second half of the article addresses “why don’t the airlines work on efficiency instead of lengthening flight times?”

***

Another quote reveals: “Over 30% of all flights arrive more than 15 minutes late every day despite padding.”

The “on-time arrival rate” blessed by the Department of Transportation (DoT) is not what you think it is.

Let’s take a random flight that takes 60 minutes. This flight schedule might be padded and advertised as departing at 2 pm and arriving at 4 pm.

If the flight departs at 2 pm and takes 60 minutes, then you’d think on-time arrival is defined as arriving at 3 pm. You might agree to allow for some slack, say, 15 minutes. In this case, on-time arrival is arriving before 3:15 pm.

Given the discussion, you now know that on-time arrival is actually arriving before 4 pm since the schedule is padded not by 15 minutes but by 60 minutes.

And then you’d be wrong! Because there is padding upon padding. Airlines are allowed to claim “on-time arrival” if the flights arrive within 15 minutes of the scheduled arrival time, which in our example, has been padded by 60 minutes. So any flight arriving before 4:15 pm is counted as “on-time.”

***

Padding is not purely a bad thing. A certain amount of padding is necessary because lots of flights are vying for a limited amount of airport and air space. A padded schedule is a more accurate schedule. It acknowledges other factors that cause delays in arrival.

The gaming of the padded metric is what works people up. Gaming is possible because padding inserts subjectivity into the measurement. So long as subjectivity cannot be avoided, gaming is here to stay.

***

The reporter said airlines have spent billions on technologies to improve efficiency, i.e., managing actual experienced time. But, “this has not moved the needle on delays, which are stubbornly stuck at 30%.”

Now square that statement with this one: “Billions of dollars in investment [in modernising air traffic control] have in fact halved air traffic control-caused delays since 2007 while airline-caused delays have soared.”

Does this sound like the new technologies have successfully reclassified delays from air traffic control caused to airline caused? Passing the buck?

The next time your flight is delayed, the airlines will likely tell you, “it’s not us, it’s the weather.”

Just finished reading The Undoing Project by Michael Lewis, his bio of the Kahneman and Tversky duo who made many of the seminal discoveries in behavioral economics.

In Chapter 7, Lewis recounts one of their most celebrated experiments which demonstrated the “base rate fallacy.”

Here is one version of the experiment. The test subjects are asked to make judgments based on a vignette.

Psychologists have administered tests to 100 people, 70 of whom are lawyers and 30 are engineers.

(A) If one person is selected at random from this group, what is the chance that the selected person is a lawyer?

(B) Dick is selected at random from this group. Here is a description of him: “Dick is a 30 year old man. He is married with no children. A man of high ability and high motivation, he promises to be quite successful in his field. He is well liked by his colleagues.” What is the chance that Dick is a lawyer?

Those subjects who answered (A) made the right judgment, in accordance with the base rate of 70 percent.

The answer to (B) should be the same, since it shouldn't matter whether the random person is named Dick or not, and the generic description provides no useful information to determine Dick’s occupation. However, those subjects who answered (B) edited the chance down to about 50-50. The experiment showed that access to Dick’s description led people astray – to ignore the base rate. Note that the base rate here is the prior probability.

***

What are the practical applications of the KT experiment for business data analysts?

tl;dr

Before throwing the kitchen sink of variables (features) into your statistical (machine learning) models, review the literature on the base rate fallacy starting with Kahneman-Tversky experiments.

1. Adding more variables can make your predictions worse

Let's start with what kind of additional information is provided by Dick’s description. The sample size has not changed – it’s still one. The data expanded only in the number of variables (or features). Specifically, these eight additional variables:

X1 = age

X2 = gender

X3 = martial status

X4 = number of children

X5 = ability level

X6 = motivation level

X7 = expected level of success in field

X8 = popularity among colleagues

In today’s age of surveillance data, it is all too easy for any analyst to assemble more variables. The KT experiment shows that having more variables does not imply you have more useful information. Worse, those extra variables may distract you from the base rate, leading to worse predictions.

2. Machines are even more susceptible than humans

If humans are prone to such mistakes, should we use machines instead? Sadly, machines will perform worse.

Machines allow us to process even more variables at even greater efficiency. Instead of eight useless variables, you can now add 800 or even 8,000 useless variables about Dick. The machines will then inform you which subset of these variables “pop.” The more useless data you add in, the higher the chance you will encounter an accidental correlation.