Tag Archives: Statistics

There’s no question that “data science” is becoming more and more popular. In fact, Booz Allen Hamilton (a consultancy) found:

The term Data Science appeared in the computer science literature throughout the 1960s-1980s. It was not until the late 1990s, however, that the field as we describe it here, began to emerge from the statistics and data mining communities. Data Science was first introduced as an independent discipline in 2001. Since that time, there have been countless articles advancing the discipline, culminating with Data Scientist being declared the sexiest job of the 21st century.

Unsurprisingly, there are countless graduate and undergraduate programs in data science (Harvard, Berkeley, Waterloo, etc.), but what is data science, exactly?

Given that the field is still in its proverbial infancy, there are a number of different perspectives. Booz Allen offers the following in their Field Guide to Data Sciencefrom 2015: “Describing Data Science is like trying to describe a sunset — it should be easy, but somehow capturing the words is impossible.”

Pithiness aside, there does seem to be consensus around some of the pertinent themes contained within data science. For instance, a key component is usually “Big Data” (both unstructured and structured data). Dovetailing with Big Data, “statistics” is often cited as an important component. In particular, an understanding of the science of statistics (hypothesis-testing, etc.), including the ability to manipulate data and almost always — the ability to turn that data into something that non-data scientists can understand (i.e. charts, graphs, etc.). The other big component is “programming.” Given the size of the datasets, Excel often isn’t the best option for interacting with the data. As a result, most data scientists need to have their programming skills up to snuff (often times in more than one language).

What’s a Data Scientist?

Now that we know the three major components of data science are statistics, programming, and data visualization, do you think you could identify data scientists from statisticians, programmers, or data visualization experts? It’s a trick question — they’re all data scientists (broadly speaking).

A few years ago, O’Reilly Media conducted research on data scientists:

Why do people use the term “data scientist” to describe all of these professionals?

[…]

We think that terms like “data scientist,” “analytics,” and “big data” are the result of what one might call a “buzzword meat grinder.” The people doing this work used to come from more traditional and established fields: statistics, machine learning, databases, operations research, business intelligence, social or physical sciences, and more. All of those professions have clear expectations about what a practitioner is able to do (and not do), substantial communities, and well-defined educational and career paths, including specializations based on the intersection of available skill sets and market needs. This is not yet true of the new buzzwords. Instead, ambiguity reigns, leading to impaired communication (Grice, 1975) and failures to efficiently match talent to projects.

So… the ambiguity in understanding the meaning of data science stems from a failure to communicate? Classic movie references aside, the research from O’Reilly identified four main “clusters” of data scientists (and roles within said “clusters”):

Within these clusters fits some of the components described earlier, including two additional components: math/operations research (including things like algorithms and simulations) and business (including things like product development, management, and budgeting). The graphic below demonstrates the t-shaped-nature of data scientists — they have depth of expertise in one area and knowledge of other closely related areas. NOTE: ML is an acronym for machine learning.

Sometime last year, I came across a speech from the 2015 Toastmasters World Champion, Mohammed Qahtani. If you have a few minutes, I really suggest you take the time to watch it. OK, let’s say you only have a couple of minutes: just watch the introduction.

**SPOILERS BELOW**

While I’m not a fan of Qahtani’s parenting style (either option), I’m going to skip over that for now, as it’s not the main reason for writing this post. I’m also going to skip over the stereotypical portrayal of scientists, again, as it’s not the main reason for writing this post (but I will say that I’ve never meant a scientist who confirms that ‘stereotypical portrayal’). The main reason for writing this post is the first few minutes of the video. The startling anecdote that Qahtani shares about smoking and diabetes. Be honest — did you believe him when he said, “the amount of people dying from diabetes is three times as many dying from smoking?” Based on the audience’s response, I suspect that there are probably — at least — some of you who didn’t know this. To be clear, it’s not my aim to make you feel bad about this. If this isn’t a piece of data you’ve been exposed to at some point in your life, you probably have little reason to know. (Unfortunately, smoking is part of my family history, so I knew Qahtani was up to something when I heard him make that statement. Oh, and if you’re curious, WHO posits that smoking is the leading cause of death where 1 in 10 adults worldwide [!] die as a result of it, whereas diabetes is ‘only’ the 7th leading cause of death in the US.)

Circling back to the video… conviction. Did you notice the conviction with which Qahtani parroted the statistics about diabetes and smoking? He said it so assuredly that it almost makes you want to believe him (or at a minimum, question whether what you thought you knew about those two pieces of statistics was true or not). When I saw him do this, it reminded me of the hundreds of articles you see published each year that advise people on how to sell themselves or their company. The infamous elevator pitch.

Invariably, when you read articles (or books!) about how to give a good elevator pitch, you’re going to find that it’s very common that one of the most important things you can do in that elevator pitch is to be confident (or passionate or some other synonym that fits nicely into the author’s acronym). Don’t get me wrong, confidence is certainly important when it comes to making your elevator pitch, but in seeing Qahtani express himself with an air of confidence, it made me wonder about the human fallibility, with regard to elevator pitches.

Sure, I suspect that for people who’s job it is to listen to elevator pitches on a constant basis will tell you that they have a finely tuned BS-detector, but what about the rest of us who haven’t spent 10,000 hours listening to elevator pitches? I bet you’re thinking that you don’t have to worry about that when it comes to your field because you’re an expert. OK. Let’s accept for a moment that you are — what about all the other fields that you haven’t achieved “expert” status in — what do you do there? Well, I suppose you/we could perfect y/our BS-detector, but I suppose there’s still the possibility that you might make a type I/II error (depending upon your perspective). That is, there’s still the possibility that you might miss the BS for what it is and it’s also possible that you might incorrectly assess something as BS when it’s actually gold!

On that note, I want to leave you with the powerful words of Dr. Maya Angelou, on words:

I’ve used the subtitle in a previous post and I think the application to the content of this post also makes it worthwhile to use again. I was reading a post from Tim Ferriss the other day and it made me think of statistics. The post is about alternative medicine, but understanding that isn’t entirely necessary for the point I’m making. Here’s some context:

Imagine you catch a cold or get the flu. It’s going to get worse and worse, then better and better until you are back to normal. The severity of symptoms, as is true with many injuries, will probably look something like a bell curve.

The bottom flat line, representing normalcy, is the mean. When are you most likely to try the quackiest shit you can get your hands on? That miracle duck extract Aunt Susie swears by? The crystals your roommate uses to open his heart chakra? Naturally, when your symptoms are the worst and nothing seems to help. This is the very top of the bell curve, at the peak of the roller coaster before you head back down. Naturally heading back down is regression toward the mean.

If you are a fallible human, as we all are, you might misattribute getting better to the duck extract, but it was just coincidental timing.

The body had healed itself, as could be predicted from the bell curve–like timeline of symptoms. Mistaking correlation for causation is very common, even among smart people.

And the important part of the quote [Emphasis Added]:

In the world of “big data,” this mistake will become even more common, particularly if researchers seek to “let the data speak for themselves” rather than test hypotheses.

Spurious connections galore–that’s what the data will say, among other things. Caveat emptor.

This analogy reminded me of the first time I learned about correlation and causation in my first psychology class as an undergraduate. It had to do with ice cream, hot summer days, and swimming pools. In fact, here’s a quick summary from wiki:

An example of a spurious relationship can be illuminated by examining a city’s ice cream sales. These sales are highest when the rate of drownings in city swimming pools is highest. To allege that ice cream sales cause drowning, or vice-versa, would be to imply a spurious relationship between the two. In reality, a heat wave may have caused both. The heat wave is an example of a hidden or unseen variable, also known as a confounding variable.

Getting back to what Ferriss was saying near the end of his quote: as “Big Data” grows in popularity (and use), there may be an increased likelihood of making errors in the form of spurious relationships. One way to mitigate this error is education. That is, if the people who are handling Big Data know and understand things like correlation vs. causation and spurious relationships, these errors may be less likely to occur.

I suppose it’s also possible that some, knowing about these kinds of errors and how little the average person might know when it comes to statistics, could maliciously report statistics based on numbers. I’d like to think that people aren’t doing this and it just has more to do with confirmation bias.

Regardless, one way to guard against this inaccurate reporting would be to use hypotheses. That is, before you look at the data, make a prediction about what you’ll find in the data. It’s certainly not going to solve all the issues, but it’ll go a long way towards doing so.

Earlier this morning, I came across a headline that was a bit shocking (to me): “Americans Support the Keystone XL Pipeline by Wide Margin.” All of the data I’d seen regarding polls of Americans showed that there certainly wasn’t a wide margin in support or against the pipeline. So, with my curiosity piqued, I clicked the article to find out that 67% (of the survey respondents) support building the pipeline. That still seemed a bit surprising, as, like I said, most polls I’d seen had stayed in the range of 45/55 or 55/45.

Upon getting to the actualsurvey, I scrolled to the question that led to the headline. Here’s the question that was read to survey respondents:

The President is deciding whether to build the Keystone X-L Pipeline to carry oil from Canada to the United States. Supporters of the pipeline say it will ease America’s dependence on Mideast oil and create jobs. Opponents fear the environmental impact of building a pipeline. What about you – do you support or oppose building the KeystoneX-L pipeline?

Do you see anything wrong with this question?

Let’s start with the idea that they’re telling respondents what supporters say and opponents say. If the respondent doesn’t really have a strong opinion about the question, they may prefer to identify with one group or the other (and they might even if they have a strong opinion!) One could argue that there’s a response bias present. There has been quite a bit of press about “America’s dependence on foreign oil.” So, someone might not want to oppose that viewpoint in a survey. That is, the respondent wouldn’t want to appear, (to the person conducting the survey), that they don’t think that reducing America’s dependence on foreign oil is as important as the environment.

Juxtaposing the dependence on foreign oil with environmental impact is a bit unfair. As I said in the previous paragraph, I’d bet that most people have heard/read something about the America’s dependence on foreign oil, but they probably don’t know very much about the environmental impact of oil. Now, that could be a messaging problem for the environmental movement, but there hasn’t been a compelling enough case made. (If there were, there certainly wouldn’t have been this many people who were “A-OK” with building the pipeline.)

The United States relied on net imports (imports minus exports) for about 40% of the petroleum (crude oil and petroleum products) that we consumed in 2012. Just overhalf of these imports came from the Western Hemisphere. Our dependence on foreign petroleum has declined since peaking in 2005. [Emphasis added]

In doing the math, 60% of the petroleum (oil) that the US consumed in 2012 was produced domestically — inside the US! In doing some more math, we’re told that just over half of the imports came from the Western Hemisphere. Meaning, less than half of the imports are coming from countries outside of the Western Hemisphere. Meaning, less than half of the imports could be coming from the Mideast and we already know that only 40% of the oil consumed in the US comes from imports. In fact, this same agency tells us just how much oil is imported from Persian Gulf countries: 29%. So, 29% of the imports (40%) is how reliant the US is on Mideast oil. Again, doing the math the total US consumption of Mideast oil: 11.6%. Does 11.6% sound like dependence?

If you recall the last line of the quote from the agency: “Our dependence on foreign petroleum has declined since peaking in 2005.”

The next time you read survey data, I hope you’ll remember this post and consider just how construed the results may be.

[Note: The title of this post is a quote that was popularized by Mark Twain.]

I know that I said that I was going to be talking about a new bias in judgment and decision-making every Monday and I know that today is Tuesday. To be honest — I underestimated how long it would take me to prepare for my seminar in International Relations. Aside: if you want to challenge yourself, take a course in a subject which you know very little about and be amazed at how much you feel like you’ve been dropped into the ocean and told to swim! It can be a little unnerving at first, but if you’re into exploring and open to new experiences, it can be quite satisfying. Anyway, so today yesterday I’d planned to talk about the framing effect, but since I so conveniently demonstrated the planning fallacy, I thought I’d talk about it.

The consequence of this post being written/published today is directly related to my falling into the trap of the planning fallacy. I planned for the preparation for my International Relations class to take a certain amount of time. When that time lasted longer than I had anticipated, I had no time left to write about a bias in judgment and decision-making. The planning fallacy is our tendency to underestimate how long we’ll need to complete a task — especially when we’ve had experiences where we’ve underestimated similar tasks.

This is something that even the best of us fall prey to. In fact, one of the biggest names in cognitive biases Daniel Kahneman (Nobel Prize in economics, but a PhD in psychology!) has said that even he still has a hard time with the planning fallacy. Of course, this doesn’t make it permissible for us not to try to prevent the effects of the planning fallacy.

Before we get into ways for avoiding the planning fallacy, I want to share an excerpt from an oft-cited study when discussing the planning fallacy [emphasis added]:

Participants were provided with a series of specific confidence levels and were asked to indicate the completion time corresponding to each confidence level. In this manner, the participants indicated times by which they were 50% certain they would finish their projects (and 50% certain they would not), 75% certain they would finish, and 99% certain they would finish. When we examined the proportion of subjects who finished by each of these forecasted times, we found evidence of overconfidence. Consider the academic projects: only 12.8% of the subjects finished their academic projects by the time they reported as their 50% probability level, only 19.2% finished by the time of their 75% probability level, and only 44.7% finished by the time of their 99% probability level. The results for the 99% probability level are especially striking: even when they make a highly conservative forecast, a prediction that they feel virtually certain that they will fulfill, people’s confidence far exceeds their accomplishments.

There were a lot of numbers/percentages offered in the excerpt, so I’ve also included a visual representation of the data in a graph below. This graph comes from a book chapter by a couple of the same authors, but it is about the data in the preceding excerpt.

Ways for Avoiding the Planning Fallacy

With the first three biases I talked about, awareness was a key step in overcoming the bias. While you could make that argument for the planning fallacy, one of the hallmarks of [the fallacy] is that people know they’ve erred in the past and still make the mistake of underestimating. So, we’ll need to move beyond awareness to help us defend against this bias.

1) Data is your friend

No, I don’t mean Data from Star Trek (though Data would probably be quite helpful in planning), but now that I think about it, Data (the character) might be a good way to position this ‘way for avoiding the planning fallacy.’ For those of you not familiar, Data is a human-like android. In thinking about this way for avoiding the planning fallacy, think about how Data might estimate the length of time it would take to complete a project. It would be very precise and data-driven. Data would likely look at past projects and how long it took for those to be finished to decide the length of time needed for this new project. To put it more broadly, if you have statistics on past projects (that were similar) absolutely use them in estimating the completion time of the new project.

2) Get a second opinion

When we think about the project completion time of one project in relation to another project, we often think about the nuances that make this project different from that project — and by extension — why this project won’t take as long as that project. Planning fallacy. If you can, ask someone who has experience in project completion in the area for which you’re estimating. When you ask this person, be sure not to tell them all the “various ways why this project is different,” because it probably isn’t and it’s only going to cloud the predictive ability of the person you’re asking. You’re probably going to hear an estimate that’s larger than you thought, but I bet you that it’s probably a lot closer to the real project completion time than the estimate you made based on thinking about the ways that this project was going to be different than all the other projects like it.

If you liked this post, you might like the first three posts in this series:

In preparing for the classes that I teach on Tuesday, I was re-reading the assigned chapters in the textbook yesterday. This week, we’re covering cross-cultural management. A few pages into the chapter, I was dismayed to read the following:

“Here are a couple of positive signs: 2008 saw record numbers of foreign students (623,805) studying in the United States and US students (241,791) studying abroad.”

Does anyone know what’s wrong with this? After reading this paragraph, I took to Twitter to respond. Let’s go to the tweets!

If someone offers absolute numbers to support their claim, be skeptical.

To better contextualize the numbers offered in the textbook, the author would need to tell offer some numbers on the recent number of foreign students studying in the US and likewise, US students studying abroad. That is, are the numbers trending up? Downward? Was this year an anomaly?

More importantly than earlier years, would be to fully contextualize it by offering percentages. Is the percentage of foreign-born students studying in the US higher than it was last year? What about for US student studying abroad?

Simply offering these absolute values is, in a sense, misleading. It conveys to the reader that foreign study is trending up, when in fact, it could be on the decline. By having more students studying (in general) there is a higher number of students who could study abroad. And that’s why it’s important to have percentages (in this case). In some cases, percentages won’t be helpful. It really all depends on the question you’re trying to answer or the information you’re trying to convey.