Posts categorized "Junk Charts"

On March 23, 2018, just after 9:30 am, a Tesla SUV plowed straight into a road divider as it was passing by the exit ramp to Route 85 from Route 101, in Mountain View, California. The impact caused the battery to burst into flames, and the driver was trapped in his seat by a seat belt. He died in the hospital.

Last week, the National Transportation Safety Bureau (NTSB) issued a preliminary report providing further details. Here are the key findings:

The "AutoPilot" feature was switched on at the time of the crash. It was on continuously for the last 19 minutes of the drive (out of 32 minutes total).

The driver's hands were not on the wheel in the last 6 seconds before the crash. In the last 60 seconds, hands were detected on three separate occasions, totaling 34 seconds.

In the final 3 seconds, still with no hands on the wheel, the car accelerated from 62 mph to 71 mph, with "no pre-crash braking or evasive steering movement detected."

The autopilot system was set to a cruise speed of 75 mph, which is above the speed limit of 65 mph.

During the 19-minute stretch before the crash, the Autopilot system warned three times about putting hands on wheel, the last time was 15 minutes prior to the crash.

The car with the single driver was on the HOV lane, which should be in effect at that time in the morning.

These are all telling observations. What did the reporter doing the Tesla beat at the Wall Street Journal think is the most salient finding? Well - the title of their article says it all: "Tesla Autopilot System Warned Driver to Put Hands on Wheel, U.S. Investigators Say." (link)

It is hard to imagine how that particular finding is the most important... unless you work for Tesla, or someone who wants to pin the blame on the driver who could not defend himself.

Problem #1

The Autopilot was continuously turned on for almost 20 minutes prior to the crash. The last warning to the driver happened "more than 15 minutes" before the crash. When driving at 60 mph and above, 15 minutes is 15 miles or more from the crash site. For example, the last warning might have been issued when passing Palo Alto, but the crash happened 15 minutes later in Mountain View. Rather than supporting the view that the driver recklessly ignored repeated warnings, this finding raises the question of why the car did not detect imminent collision, and take evasive action.

Problem #2

Hands not on wheel is taken to be a finding of major importance but here is the problem: every accident investigation will discover hands not on wheel. Why? Because our sample consists of only car-casses (pun intended).

If the driver had taken evasive steering action, it would have compensated for computer error, and the accident would have been avoided!

This is a reverse of the classical Wald paradox that I covered some years ago. In that example, statistician Abraham Wald warned that you couldn't inspect the damage on warplanes that returned home to determine which parts of the planes were most subject to damage - because your sample is missing the planes which were shot down!

In our example, we get to see the dead but not the survivors. In order to understand whether the computer made errors, we'd need to also include cases in which the driver prevented the Autopilot from crashing the car.

Problem #3

Given that the driver did not steer the wheel in the last seconds, why did the Tesla vehicle accelerate? Why was it allowed to go above the speed limit? This suggests technical challenges.

One statistical concept that instructors frequently don't have time to cover in Stat 101 is the "interaction" effect. I will explain this concept using the fantastic interactive graphic by the visualization team at the German publication Zeit (please also read the corresponding post on Junk Charts here for some background.)

When we ignore interactions, we end up with overly simplistic statistical summaries. For example, some study might find that drinking six cups of coffee reduces the chance of prostate cancer by 60 percent. But is this effect the same for all age groups? Is it possible that the risk reduction is higher for older men and lower for younger men, for example? When we are asking questions like these, we are asking whether the effect of coffee consumption on health interacts with age.

In the Zeit visualization, the following graph illustrates that Germans living in the former East Germany (blue line, referred to as East Germans hereafter) are more likely to hold negative views about smoking cannabis, relative to Germans living in West Germany (yellow; West Germans). On average, about 70 to 80 percent of the people deem the activity "bad" or "very bad" and the East-West gap was roughly 8%.

A relevant question here is whether the size of this East-West gap varies by age.

In the interactive visualization, one can click on different age groups to observe how the lines shift. The left chart below shows the East-West gap for 45-64 year olds while the right shows the gap for people aged 65 and above.

The strongest signal to hit us is probably that the 65-and-above cohort has a much more negative opinion of smoking cannabis than the younger cohort. This is a statement about the effect of age on the attitude towards cannabis, regardless of where they live. It's about a single factor, and so it isn't an interaction effect.

The next observation is that in the 65-and-above cohort, the East-West gap is noticeably smaller than the average gap, and the gap has remained essentially unchanged over the last 12 years. However, for the 45 to 64 age group (left chart), the gap has markedly increased, meaning that in 2012, East Germans were about 15 percent more likely to dislike smoking than West Germans even though both groups started out with about the same attitude in 2000.

Statisticians call this a significant interaction effect between East-West and age group. When an interaction effect exists, the aggregate statistic is not very useful as it fails to acknowledge the variability by age. While the average gap was 8%, the gap for people above 65 was only half of that while the gap for the 45 to 64 cohort was almost twice that. (I am ignoring the other age groups; just keep clicking.)

It is important to distinguish between three effects: the "main" effect of East Germany vs West Germany, the "main" effect of age group, and the interaction effect between East/West and age group.

When you ignore interaction effects, you are assuming "additivity". Unfortunately, when it comes to statistics, one plus one usually do not add up to two! This point causes much confusion for non-statisticians. In statistics, "one" is not an exact quantity; there is a margin of error around it.

***

As a second example, consider the following graph which shows the East-West gap on the issue of whether working mothers are good.

The two bolded lines represent the average person in East and West Germany. We see that across time the East-West gap has marginally narrowed from 28% to 25% over 25 years.

Does this gap vary depending on whether the respondent is male or female? To see this, we split the male and female responses. Below, the males are shown in gray on the left chart and the females are in gray on the right side.

First thing to notice is that the two lines for men (East and West) are both below average, meaning that men are less likely to accept working mothers. Not surprisingly, the lines for women (on the right chart) show they are more likely to accept working mothers. But these comments relate to the effect of gender on the working mothers issue. What we are interested in is whether the East-West difference is affected by gender.

This means, we care about the gaps between the gray lines on both charts. With only a little effort, you can see that the gap is wider on the left chart than on the right. This means, the gap between West and East German men is larger than that between West and East German women when it comes to working mothers. Statisticians call this an interaction effect. When such an effect is significant, statisticians prefer to talk about the genders separately, rather than combining them into one average.

***

Next time you run a regression, add some interactions and see if it makes a difference. I address this issue in Chapter 3 of Numbers Rule Your World.

A lot of Big Data analyses default to analyzing count data, e.g. number of searches of certain keywords, number of page views, number of clicks, number of complaints, etc. Doing so throws away much useful information, and frequently leads to bad analyses.

***

I was reminded of the limitation of count data when writing about the following chart, which I praised on my sister blog as a good example of infographics, a genre chock-full of deplorable things.

On the other blog, I explained why I prefer to hide the actual numbers, from a dataviz perspective.

There is also a statistical reason for not drawing undue attention to the counts.

These counts do not indicate the severity of the injuries: some may have knocked the player out of the game, others may have been much milder. Some injuries may be sustained by first-team players who spend a much longer time on the field than backups, thus raising their rate of injuries.

Another statistical consideration is heterogeneity. I'd like to see a small-multiples version of this chart, with the data split by position on the field. I think it will be quite telling which body parts are hurt more depending on one's role in the game. Similarly, splitting by age, body size, and other factors will yield interesting insights.

***

At about the same time, I was reading the July issue of Significance magazine (an RSS and ASA publication). Here is the link (not free).

In an article about assessing whether iceberg risk was particularly high in the year of the Titanic, the authors quantified the risk in terms of "number of icebergs crossing latitude 48 N each year". It'd seem worthwhile to ask whether there is also a relevant size distribution.

Then, in an article about "black box modeling" (i.e. data mining) by Max Kuhn and Kjell Johnson, they invoke an example of the FDA adverse event reporitng database, an example of "events data". Events data is everywhere these days, and the most popular analyses of such data revolve around counting the number of adverse events. The severity and type of events are frequently ignored.

P.S. In their otherwise gung-ho article, Kuhn and Johnson also point to one of the biggest challenges of OCCAM data. "If there is a systematic bias in a small data set, there will be a systematic bias in a larger data set, if the source is the same."If one is analyzing the FDA adverse events database, one must hope to apply the learning to people who don't yet have adverse events, but then such an analysis would be flawed since the database doesn't have any controls, i.e. people without these adverse reactions.

Thanks to the ~200 or so people who showed up at last week's Data Scientist Meetup in Cambridge, Mass., hosted by John Baker. I gave a brief introduction to the concept of "numbersense", and was part of a panel of "chief data scientists" talking about how to run data teams. Thanks to those who asked questions.

This month, I am back in New York, and will be giving two talks.

First up is the Data Visualization New York Meetup organized by Paul Trowbridge. The link to register is here but it looks like all slots have been taken within days. You should get on the wait list as some registrants will eventually drop out. This event is on Aug 20 (Wed).

On Aug 26 (Tues), I am giving the "thought leader" presentation for the Optimizely Experience. I will be talking about statistical testing for online marketing aka A/B testing. The title of the talk is "Five Questions About Testing You Wanted to Ask But Didn't" unless I come up with something better. You can register here.

This will be a brand-new presentation, and I look forward to sharing my ten-plus years of running online experiments. See you there!

***

Also, please let the organizers at SXSW know you want to hear me and other data viz experts talk about visualizing data in Austin. Jon Schwabish has put together a fabulous panel with people from different parts of the spectrum, and it promises to be an engaging conversation.

In case you are not subscribed to my dataviz feed, I put up a post yesterday that is highly relevant to readers here interested in statistical topics. The post discusses a graphic of a New York Times article that interprets the official inflation rate (known as the CPI). I devoted an entire chapter of Numbersense (link)to the question of why the official inflation rate diverges from our everyday experience.

In a larger context, inflation rate is an invented metric, invented to measure some quantity that has no objective reality. This is true of a lot of statistics. Revenues and profits are also invented concepts, for example, and only attain meaning through generally accepted accounting rules. Obesity, which is discussed in Chapter 2 of Numbersense (link) is another example of a quantity that has meaning only because of a convention of measuring.

The article in NYT brings up one of the points I raised in the book, which is that price increases are magnified in our imagination while price decreases are taken for granted.

The other larger point of the chapter on inflation is that anyone wishing to comment on whether CPI reflects real experience ought to understand how CPI is constructed. A superficial understanding such as that it is the average price of a basket of goods is useless because there are so many little details that affect the statistic. Because inflation has no objective basis, it is pointless to argue if it reflects reality: all we are left with is discussing the rules and you can't discuss the rules without knowing them well.

Details matter a lot in statistics. This is one of the reasons why I keep asking my Big Data colleagues to talk specifics. A statistician who only talks in generality is like the Manhattan realtor who can't tell you the size of the listed apartment.

For those who weren't able to attend my recent talks, a few have surfaced online.

***

JMP put up the video of the webcast from last Friday with Alberto Cairo, a data visualization expert and author of The Functional Art. You can access it from here. This event is part of their Analytically Speaking series with recent guests such as David Hand and Michael Schrage. I also appear on this recording of the panel celebrating the International Year of Statistics.

***

Agilone, an emerging vendor of self-service marketing analytics software, hosted me at their recent user conference, as well as a webcast. Here is a clip, in which I explain the structure of analytics teams that I have assembled.

***

Last year, I gave a fun, lightning talk at the Leaders in Software & Art conference. The recording is here.

***

Joe Dager did several long interviews with me that is well worth listening to. Here's Part 1, and then there's Part 2.

Long before I came up with "numbersense," I wrote about "true lies" in data analysis. (link)

The nature of data, especially Big (as in multidimensional) Data, is that one can come up with an infinite number of statistical computations, all of which are "true" in the sense that one would obtain such statistics were one to plug the data into textbook formulas. Inevitably, some of these statistics lead to contradictions.

An example I give in the Prologue of Numbersense (link) is a case of Simpson's Paradox. There are two ways to compare two airlines' rate of delays during a given window of time at a common set of arriving airports. One can aggregate the number of flights across all airports, then compare the average rate of delay. Alternatively, one can compute a pair of delay rates for each airport, then compare the rates by airport. In the example given in the book, airline A came out ahead in the aggregate measure but was the more delayed at each of the individual airports. This is an instance of "true lies". Airline A is either better or worse, not both. Given that answer, one of the two methods leads to the wrong interpretation. But one cannot complain that there is anything wrong with the data or either formula used to compute the average.

***

I was thinking about "true lies" while reading the exchange between Alberto Cairo and Andy Kirk about the following chart, which prints a "truth," that half of US economic activity occur in major urban areas that constitute a tiny proportion of US territory.

Cairo complains that this chart is silly because about half the US population live in those orange urban areas so in reality, anyone who accepts the meme that this map has "incredible" insight is just surprised that half the US population live in major urban areas.

I get that GDP is essentially a proxy indicator for where people are living yet I still have a novel interest in learning about the dynamics of the US. I *know* that there is not a uniform distribution of where people live (nowhere on earth has this) but it is still revealing for me to see anything that represents a proxy of this skewed population. I don’t think the map claims to be doing anything different to this so, in that sense, it doesn’t mislead or make false claims.

I will be writing on my other blog about the educational aspect of a chart like this, which is the other prong of Kirk's argument. That last sentence, which I bolded, strikes me as the argument that the true lie is true and therefore is beyond reproach. This is a crucial difference between doing statistics and doing pure math. In statistics, you can't win arguments by invoking the truth... if the truth is knowable, statisticians would all be unemployed.

The map does not make false claims but it leads readers to the conclusion that the orange areas are much more important than the blue region (equal economic activity but much smaller area). The first problem is that the types of economic activities are vastly different between those regions, and this significant factor is ignored.

The second problem is that the designer over-aggregated the data. All counties (or zip codes) are classified into two groups ("split in half") when in fact, the level of economic activity at the level of counties (or zip codes) is a gradient. Imagine plotting the economic activity index by county, ordered from the highest to the lowest. Do we see a dramatic drop-off after counting out half the counties (i.e., the pattern shown on the left chart below)? Or are we more likely to see the pattern shown on the right? If you see a distribution like the one shown on the right, would you summarize that with just two segments?

***

Cairo's general point is that good data visualizations require good data analyses. In turn, good data analysis requires numbersense.

***

Chapter 3 of Numbers Rule Your World (link) explores the question of aggregating data, which is central to statistical thinking. Aggregation features throughout Numbersense (link), particularly in Chapter 1 (school rankings), the chapters on economic statistics, and the chapter on fantasy football.

Also, you can learn statistical concepts from me at NYU. New course starting first week of March. More information here.

One oft-repeated "self-evident" tenet of Big Data is that data end all debate. Except if you have ever worked for a real company (excluding those ruled by autocrats), and put data on the table, you know that the data do not end anything.

Reader Ben M. sent me to this blog post by Benedict Evans, showing a confusing chart showing how Apple has "passed" Microsoft. Evans used to be a stock analyst before moving to Andreessen Horowitz, a VC (venture capital) business. He has over 25,000 followers on Twitter.

I'll get to this chart later but feel free to tour around the comments area. You will get a feel for the types of conversations that happen when an analyst offers data to a corporate meeting. It turns out that everyone has an opinion about everything, and while data people think the only way to make an argument is by presenting data, there are plenty of other people from other background who use other modes of persuasion, like rhetoric, anecdote, philosophy, and HIPPO (i.e. HIghest Paid Person's Opinion).

The original post plus these comments present a mishmash of metrics that can be used to measure the gap between Microsoft and Apple. Here is the list roughly in chronological order up to the point where the post had 70 comments:

From the original chart, we have PC shipments, iphone and ipod shipments, shipments of all devices running MacOS or iOS (possibly excluding AppleTV), PC shipments against all devices running iOS, PC plus smartphone (excluding tablets and music players) shipments against all devices running iOS.

Then in the comments, people are bringing up: Xboxes (a gaming device running on Windows OS which is almost a PC), Android devices (which is part of the Google universe), profits instead of units, servers, Google, Macs that are running Windows OS, whether it is accurate to say "Apple passes Microsoft" using the blogger's own metric, whether one needs to wait more than one quarter to make such a conclusion, TVs, printers, the difference in technology that resides inside the same device (say the PC), whether the fourth quarter result can be generalized, embedded devices, self-built PCs, video-editors on the Web, and Amazon.

***

Evans responded to many of these comments by complaining that readers are not getting his message. That's an accurate statement, and it has everything to do with the looseness of his data. This reminds me of Gelman's statistical parable. The blogger here is not so much interested in how strong his evidence is but more interested in evangelizing the morale behind the story.

His primary thesis is quite likely correct. There is a huge trend of U.S. consumers spending more time on mobile devices, mostly smartphones and also some tablets. I see this at work, and the trend is widely recognized in the tech world by this time.

But the readers are also right to point out the deficiency of that chart.

***

Using my Junk Charts trifecta checkup, we'd say the failure of the chart is due to the Data corner. The chart addresses an interesting Question, and the Graphical elements are acceptable but the Data just do not do justice to his credible thesis.

This reader for example supports Evans, and blames other readers for an outdated way of thinking about the computer industry:

That's fair, except that Evans is also guilty of old-school thinking when he used unit sales of devices as the metric to compare Apple and Microsoft. What he really needs is data on time spent. This can be obtained from Nielsen or Gallup polls or some such venue. Just count the number of hours the average person spends on iOS devices versus Windows devices - regardless of whether they are phones, PCs, laptops, gaming consoles, music players, or embedded devices. This is sometimes called a "mindshare" metric.

To get even more sophisticated, we should merge mindshare with revenues (or profits) generated. Different companies have different business models. Those that give away services (or hardware) for free (or close to free) can gain mindshare at the expense of revenues (or profits). I like a metric such as dollars generated per hour used.

***

Back to my original point. Doing data analysis is a good first step. The bigger challenge is pushing people with preconceptions to believe the analysis (with all its imperfections and assumptions) and to change their minds.

First, I saw Andrew Gelman's rant about "big bad education" (link) which leads me to Mark Palko's rant about teaching "the Law of Large Numbers" in the new "Common Core" curriculum for New York schools. Mark's conclusion being:

If we start talking about setting aside significant time to cover probability and statistics accurately and in reasonable depth and put the ideas in proper context, you have my enthusiastic support, but until then maybe we should focus on the understanding, mastery, retention of the stuff that's already in the curriculum.

The Law of Large Numbers isn't even mastered by adults and college graduates and even PhDs. It's hard to imagine how we could do it justice in high schools. For example, there is mass confusion about what it is, as I have written about here.

***

On the sister blog Junk Charts, I mentioned the article I wrote for Imagine magazine, which is targeted at high school students. Hopefully, a few kids decide to take up statistics after reading my little contribution (link to PDF).

***

There are lots of bad things happening in education. To start with, the pay-for-performance concept being imported from the business world is singularly inappropriate when no one has been able to properly measure "performance". Even in the corporate world, I don't think I have come across a study that shows that CEOs, corporate board members or senior executives receive pay commensurate with their level of performance.