M4-V3 Causality

Welcome to this course on Data Analytics for Lean Six Sigma.
In this course you will learn data analytics techniques that are typically useful within Lean Six Sigma improvement projects. At the end of this course you are able to analyse and interpret data gathered within such a project. You will be able to use Minitab to analyse the data. I will also briefly explain what Lean Six Sigma is.
I will emphasize on use of data analytics tools and the interpretation of the outcome. I will use many different examples from actual Lean Six Sigma projects to illustrate all tools. I will not discuss any mathematical background.
The setting we chose for our data example is a Lean Six Sigma improvement project. However data analytics tools are very widely applicable. So you will find that you will learn techniques that you can use in a broader setting apart from improvement projects.
I hope that you enjoy this course and good luck!
Dr. Inez Zwetsloot & the IBIS UvA team

教學方

Inez Zwetsloot

腳本

Correlation does not imply causation. It's a statement that statistic teachers really like to repeat. They really want you to remember this, and I am not an exception. But why do you need to remember this? Let me explain. Correlation means that two things move in the same direction. They show correlation. Whereas causation means that the one thing causes the other to change. I will show you various examples in order to explain to you what causality means and why it's different from correlation. And to teach you to interpret your results correctly. So what does causality mean? It's about cause and effect relationships. This means when you change your X variable and nothing else your Y variable will change as a result. This is often very straight forward. For example, if children grow, they also become heavier, hence growing or becoming taller causes their weight to go up. We use an arrow to show this causal relationship. However, sometimes, the causal relationship can be a little less straight forward. To illustrate causality, and what can go wrong, let me show you four examples of where correlation is often mixed up with causality. I will show you an example the wrong direction of causality, time problem, the third variable problem and the underlying variable problem. Let's start with the wrong direction. Imagine, you want to test how the extraction time for decaffeinated coffee can be reduced, because shorter production processes are more profitable. You measure extraction time and also measure the caffeine percentage of the coffee. You make a graph of extraction time versus caffeine percentage, and this is the scatter plot. It shows you a clear relationship, and you conclude the following. The caffeine percentage influences my extraction time. To have a low extraction time, you need coffee beans with a lot of caffeine. However, if you think about this logically, of course this conclusion is wrong. The correct conclusion is actually that's extraction times, influence the caffeine percentage in your coffee. You have basically turned around the direction of causality. You have assumed that the X variable causes Y, when in fact it is the Y variable that's causing X. You always have to be careful when drawing conclusions since statistics alone cannot tell you anything about the direction of your causal relationship. Another possible problem with casuality is the time problem. To illustrate this, I have collected data on two phenomena. I have collected data on the market share if Internet Explorer and I have data on the annual number of murders in the United States per 100,000 inhabitants. This is my data. And I made a graph of these two variables. Do you see the relationship? As Internet Explorer's market share goes up the number of murders also goes up. A rather rash conclusion would be that in order to reduce the amount of murders in the US, new laws should be developed that discourage the use of Internet Explorer as a web browser. Underlying this conclusion is the wrong causal relationship that market share of Internet Explorer is causing a number of murders in the US. Obviously, this is not true. So what is the actual explanation here? The data is real and it is annual data starting in 2006. Let's have a look at the two variables in a time graph separately. From 2006 on we see that Internet Explorer has lost market share probably due to competition from browsers like Firefox and Google Chrome. In the meanwhile, the situation at the United states has improved in the sense that the number of murders has decreased. So in fact Internet Explorer, your X variable and the number of murders, your Y variable are not related at all. They just both decrease over time due to different causes. Let's have a look at another possible problem with causality. This is the third variable problem. Imagine, you work at an insurance company and you want to find out if the claims damage due to fires can be decreased. You find a potential influence factor and that is, the number of firemen sent. What relationship do you expect? Probably more firemen means lower damages. You collect data and make a graph. Surprisingly, the effect you find is the opposite direction than what you would expect. The more fireman you deploy, the bigger the damage. What is going on here? Obviously, this is not because more firemen cause more damage. There is a third variable that influences both factors, namely the size of the fire. If the fire is bigger, more firemen are sent. But there will also be more damage, causing the damage and the number of firemen to increase or decrease at the same time. You have mistakenly assumed that the number of firemen, your X variable, causes the damage size, your Y variable. While in fact there is a third variable, the size of the fire, let's call it zed, which is causing both X and Y to increase or decrease at the same time. The fourth problem with causality that you may encounter is the underlying variable problem. Remember the caffeine example? You may remember that we have found that the extraction machine is an influence factor for the caffeine percentage. And that machine three did not deliver coffee that was conformed as specification of 0.1% of caffeine. Now you might conclude that machine three needs to be replaced. However, let's have a look at this second graph. It shows that extraction time varies across the extraction machines. It shows therefore, that extraction machine is not the problem. Actually, it is the setting of the machine. As machine three is said to take less time for the extraction of caffeine. Thus machine three is not performing worse than the other two, the settings of extraction time are simply not correct. So we have assumed that X is causing Y, that is the extraction number is causing the caffeine percentage to be too high while, in reality, there is an underlying factor that causes the behavior of the machines and that variable is extraction time. So how do you make sure that you are assuming your causality correctly? One option is to perform a so called controlled experiment. Normally, in observational data, your X variable and all the other variables are varying naturally. In a controlled experiment, you manipulate your X variable, yourself and you keep all other factors constant and then look what happens to your Y. In the example of the damages in the fire, you should send a random number of fireman to fires that are all of equal size, and then study the damage. However, especially in social sciences it is not always possible or ethical to change your X variable and keep everything else constant. Therefore, we need different ways to detect correct causal relationships. One option is to study the literature that argues why, or why not, the certain relationship is causal. Or you can use logical thinking, and time order which variable comes first, and which variable follows. In summary, we discussed four different types of errors you can make when you assume causality. You can assume a causal relationship in the wrong direction. You can relate two variables to each other that are actually unrelated but change the same way over time. You can have a third variable that causes two variables to change at the same time. Or you can have an underlying variable which is actually the cause of the changes in Y instead of your initial X variable. You can avoid making these errors by performing a controlled experiment or using additional literature that explains to you the cause and effect relationship. Logical thinking might also help to avoid making these errors. For example, with the time order problem, you can always ask yourself, what came first? So a final warning to everybody doing statistical analysis. Always be careful when interpreting the causal relationship.